ユニコーン：視覚言語モデル訓練のためのテキストのみのデータ合成

要旨

視覚言語モデル（VLM）の訓練には通常、大規模で高品質な画像-テキストペアが必要ですが、そのようなデータを収集または合成するにはコストがかかります。一方、テキストデータは豊富で安価であり、以下の疑問が生じます：高品質なマルチモーダル訓練データをテキストのみから合成できるか？この課題に取り組むため、我々はクロス統合型の3段階マルチモーダルデータ合成フレームワークを提案し、Unicorn-1.2MとUnicorn-471K-Instructionという2つのデータセットを生成します。第1段階：多様なキャプションデータ合成では、大規模言語モデル（LLM）を使用してスパースなキャプションシードを拡張し、120万の意味的に多様な高品質キャプションを構築します。第2段階：指示チューニングデータ生成では、47万1千のキャプションをさらに処理し、複雑な推論をサポートする多ターン指示チューニングタスクに変換します。最後に、第3段階：モダリティ表現変換では、これらのテキストキャプション表現を視覚表現に変換し、多様な合成画像表現を生成します。この3段階プロセスにより、実画像に依存せずに、事前訓練用のUnicorn-1.2Mと指示チューニング用のUnicorn-471K-Instructionを構築できます。実画像への依存を排除しながらデータの品質と多様性を維持することで、我々のフレームワークはVLM訓練のためのコスト効率が高くスケーラブルなソリューションを提供します。コードはhttps://github.com/Yu-xm/Unicorn.gitで公開されています。

English

Training vision-language models (VLMs) typically requires large-scale, high-quality image-text pairs, but collecting or synthesizing such data is costly. In contrast, text data is abundant and inexpensive, prompting the question: can high-quality multimodal training data be synthesized purely from text? To tackle this, we propose a cross-integrated three-stage multimodal data synthesis framework, which generates two datasets: Unicorn-1.2M and Unicorn-471K-Instruction. In Stage 1: Diverse Caption Data Synthesis, we construct 1.2M semantically diverse high-quality captions by expanding sparse caption seeds using large language models (LLMs). In Stage 2: Instruction-Tuning Data Generation, we further process 471K captions into multi-turn instruction-tuning tasks to support complex reasoning. Finally, in Stage 3: Modality Representation Transfer, these textual captions representations are transformed into visual representations, resulting in diverse synthetic image representations. This three-stage process enables us to construct Unicorn-1.2M for pretraining and Unicorn-471K-Instruction for instruction-tuning, without relying on real images. By eliminating the dependency on real images while maintaining data quality and diversity, our framework offers a cost-effective and scalable solution for VLMs training. Code is available at https://github.com/Yu-xm/Unicorn.git.

ユニコーン：視覚言語モデル訓練のためのテキストのみのデータ合成

Unicorn: Text-Only Data Synthesis for Vision Language Model Training

要旨

Support