建構和更深入了解視覺語言模型：洞見與未來方向

摘要

視覺語言模型（VLMs）領域以圖像和文本作為輸入並輸出文本的模型正在快速發展，對於開發流程中的數據、架構和訓練方法等幾個關鍵方面尚未達成共識。本文可被視為建立 VLM 的教程。我們首先提供了當前最先進方法的全面概述，突出每種方法的優勢和劣勢，解決該領域的主要挑戰，並提出對未充分探索領域的有前景的研究方向。然後，我們逐步介紹建立 Idefics3-8B 的實際步驟，這是一個強大的 VLM，明顯優於其前身 Idefics2-8B，同時在僅使用開放數據集並使用簡單流程的情況下高效訓練。這些步驟包括創建 Docmatix，這是一個用於提高文檔理解能力的數據集，比以前可用的數據集大 240 倍。我們釋出了該模型以及為其訓練而創建的數據集。

English

The field of vision-language models (VLMs), which take images and texts as inputs and output texts, is rapidly evolving and has yet to reach consensus on several key aspects of the development pipeline, including data, architecture, and training methods. This paper can be seen as a tutorial for building a VLM. We begin by providing a comprehensive overview of the current state-of-the-art approaches, highlighting the strengths and weaknesses of each, addressing the major challenges in the field, and suggesting promising research directions for underexplored areas. We then walk through the practical steps to build Idefics3-8B, a powerful VLM that significantly outperforms its predecessor Idefics2-8B, while being trained efficiently, exclusively on open datasets, and using a straightforward pipeline. These steps include the creation of Docmatix, a dataset for improving document understanding capabilities, which is 240 times larger than previously available datasets. We release the model along with the datasets created for its training.

建構和更深入了解視覺語言模型：洞見與未來方向

Building and better understanding vision-language models: insights and future directions

摘要

Support