构建和更好理解视觉-语言模型：见解与未来方向

摘要

视觉-语言模型（VLMs）领域以图像和文本作为输入并输出文本，正在迅速发展，但在开发流程的若干关键方面尚未达成共识，包括数据、架构和训练方法。本文可被视为构建VLM的教程。我们首先全面概述当前最先进的方法，突出每种方法的优势和劣势，解决该领域面临的主要挑战，并为未充分探索的领域提出有前景的研究方向。然后，我们详细介绍构建Idefics3-8B的实际步骤，这是一种强大的VLM，明显优于其前身Idefics2-8B，同时在开放数据集上高效训练，使用简单的流程。这些步骤包括创建Docmatix，一个用于提高文档理解能力的数据集，比以前可用的数据集大240倍。我们发布了该模型以及为其训练创建的数据集。

English

The field of vision-language models (VLMs), which take images and texts as inputs and output texts, is rapidly evolving and has yet to reach consensus on several key aspects of the development pipeline, including data, architecture, and training methods. This paper can be seen as a tutorial for building a VLM. We begin by providing a comprehensive overview of the current state-of-the-art approaches, highlighting the strengths and weaknesses of each, addressing the major challenges in the field, and suggesting promising research directions for underexplored areas. We then walk through the practical steps to build Idefics3-8B, a powerful VLM that significantly outperforms its predecessor Idefics2-8B, while being trained efficiently, exclusively on open datasets, and using a straightforward pipeline. These steps include the creation of Docmatix, a dataset for improving document understanding capabilities, which is 240 times larger than previously available datasets. We release the model along with the datasets created for its training.

构建和更好理解视觉-语言模型：见解与未来方向

Building and better understanding vision-language models: insights and future directions

摘要

Support