コード誘導型合成マルチモーダルデータ生成によるテキスト豊富な画像理解のスケーリング

要旨

チャートや文書などのリッチテキストを含む画像の推論は、視覚言語モデル（VLM）の重要な応用分野です。しかし、VLMはこれらの領域でしばしば苦戦します。その主な理由は、多様なテキストリッチな視覚言語データの不足にあります。この課題に対処するため、我々はCoSynを提案します。CoSynは、テキストのみの大規模言語モデル（LLM）のコーディング能力を活用し、合成テキストリッチなマルチモーダルデータを自動生成するフレームワークです。特定のドメイン（例：「栄養成分表示ラベル」）を記述した入力テキストを与えると、CoSynはLLMにPython、HTML、LaTeXなどのコードを生成させ、合成画像をレンダリングします。合成画像の基盤となるコードをテキスト表現として利用することで、CoSynはテキストのみのLLMに再度依存して、高品質な指示チューニングデータを生成できます。CoSynを使用して、我々は40万枚の画像と270万行の視覚言語指示チューニングデータを含むデータセットを構築しました。7つのベンチマークでの包括的な実験により、我々の合成データで訓練されたモデルが、Llama 3.2を含む競合するオープンソースモデルの中で最先端の性能を達成し、GPT-4VやGemini 1.5 Flashなどのプロプライエタリモデルを上回ることが示されました。さらに、CoSynは合成ポインティングデータを生成することができ、VLMが入力画像内の情報をグラウンディングすることを可能にし、現実世界の環境で行動可能なマルチモーダルエージェントの開発における潜在能力を示しています。

English

Reasoning about images with rich text, such as charts and documents, is a critical application of vision-language models (VLMs). However, VLMs often struggle in these domains due to the scarcity of diverse text-rich vision-language data. To address this challenge, we present CoSyn, a framework that leverages the coding capabilities of text-only large language models (LLMs) to automatically create synthetic text-rich multimodal data. Given input text describing a target domain (e.g., "nutrition fact labels"), CoSyn prompts an LLM to generate code (Python, HTML, LaTeX, etc.) for rendering synthetic images. With the underlying code as textual representations of the synthetic images, CoSyn can generate high-quality instruction-tuning data, again relying on a text-only LLM. Using CoSyn, we constructed a dataset comprising 400K images and 2.7M rows of vision-language instruction-tuning data. Comprehensive experiments on seven benchmarks demonstrate that models trained on our synthetic data achieve state-of-the-art performance among competitive open-source models, including Llama 3.2, and surpass proprietary models such as GPT-4V and Gemini 1.5 Flash. Furthermore, CoSyn can produce synthetic pointing data, enabling VLMs to ground information within input images, showcasing its potential for developing multimodal agents capable of acting in real-world environments.

コード誘導型合成マルチモーダルデータ生成によるテキスト豊富な画像理解のスケーリング

Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation

要旨

Support