ビジョン言語モデルの構築と理解を深める：洞察と将来の方向性

要旨

画像とテキストを入力とし、テキストを出力するビジョン言語モデル（VLM）分野は急速に進化しており、データ、アーキテクチャ、トレーニング方法など開発パイプラインのいくつかの重要な側面についてはまだ合意が得られていません。この論文は、VLMを構築するためのチュートリアルと見なすことができます。現在の最先端アプローチの包括的な概要を提供し、各アプローチの強みと弱みを強調し、分野の主要な課題に取り組み、未開拓の領域に向けた有望な研究方向を提案します。次に、効率的にトレーニングされ、オープンデータセットのみを使用し、簡潔なパイプラインを用いて、先行モデルIdefics2-8Bを大幅に上回る強力なVLMであるIdefics3-8Bを構築する実践的な手順を説明します。これらの手順には、ドキュメント理解能力を向上させるためのデータセットであるDocmatixの作成が含まれており、これは以前の利用可能なデータセットよりも240倍大きいものです。我々は、そのトレーニング用に作成されたデータセットとともにモデルを公開します。

English

The field of vision-language models (VLMs), which take images and texts as inputs and output texts, is rapidly evolving and has yet to reach consensus on several key aspects of the development pipeline, including data, architecture, and training methods. This paper can be seen as a tutorial for building a VLM. We begin by providing a comprehensive overview of the current state-of-the-art approaches, highlighting the strengths and weaknesses of each, addressing the major challenges in the field, and suggesting promising research directions for underexplored areas. We then walk through the practical steps to build Idefics3-8B, a powerful VLM that significantly outperforms its predecessor Idefics2-8B, while being trained efficiently, exclusively on open datasets, and using a straightforward pipeline. These steps include the creation of Docmatix, a dataset for improving document understanding capabilities, which is 240 times larger than previously available datasets. We release the model along with the datasets created for its training.

ビジョン言語モデルの構築と理解を深める：洞察と将来の方向性

Building and better understanding vision-language models: insights and future directions

要旨

Support