獨角獸：專為視覺語言模型訓練設計的純文本數據合成

摘要

訓練視覺-語言模型（VLMs）通常需要大規模、高質量的圖像-文本對，但收集或合成此類數據成本高昂。相比之下，文本數據豐富且成本低廉，這促使我們思考：能否僅從文本中合成高質量的多模態訓練數據？為解決這一問題，我們提出了一個跨整合的三階段多模態數據合成框架，該框架生成了兩個數據集：Unicorn-1.2M 和 Unicorn-471K-Instruction。在第一階段：多樣化字幕數據合成中，我們通過使用大型語言模型（LLMs）擴展稀疏的字幕種子，構建了120萬個語義多樣的高質量字幕。在第二階段：指令微調數據生成中，我們進一步將47.1萬個字幕處理成多輪指令微調任務，以支持複雜推理。最後，在第三階段：模態表示轉換中，這些文本字幕表示被轉化為視覺表示，從而生成多樣的合成圖像表示。這一三階段過程使我們能夠構建用於預訓練的Unicorn-1.2M和用於指令微調的Unicorn-471K-Instruction，而無需依賴真實圖像。通過在保持數據質量和多樣性的同時消除對真實圖像的依賴，我們的框架為VLMs訓練提供了一種成本效益高且可擴展的解決方案。代碼可在https://github.com/Yu-xm/Unicorn.git獲取。

English

Training vision-language models (VLMs) typically requires large-scale, high-quality image-text pairs, but collecting or synthesizing such data is costly. In contrast, text data is abundant and inexpensive, prompting the question: can high-quality multimodal training data be synthesized purely from text? To tackle this, we propose a cross-integrated three-stage multimodal data synthesis framework, which generates two datasets: Unicorn-1.2M and Unicorn-471K-Instruction. In Stage 1: Diverse Caption Data Synthesis, we construct 1.2M semantically diverse high-quality captions by expanding sparse caption seeds using large language models (LLMs). In Stage 2: Instruction-Tuning Data Generation, we further process 471K captions into multi-turn instruction-tuning tasks to support complex reasoning. Finally, in Stage 3: Modality Representation Transfer, these textual captions representations are transformed into visual representations, resulting in diverse synthetic image representations. This three-stage process enables us to construct Unicorn-1.2M for pretraining and Unicorn-471K-Instruction for instruction-tuning, without relying on real images. By eliminating the dependency on real images while maintaining data quality and diversity, our framework offers a cost-effective and scalable solution for VLMs training. Code is available at https://github.com/Yu-xm/Unicorn.git.

獨角獸：專為視覺語言模型訓練設計的純文本數據合成

Unicorn: Text-Only Data Synthesis for Vision Language Model Training

摘要

Support