ChatPaper.aiChatPaper

獨角獸:專為視覺語言模型訓練設計的純文本數據合成

Unicorn: Text-Only Data Synthesis for Vision Language Model Training

March 28, 2025
作者: Xiaomin Yu, Pengxiang Ding, Wenjie Zhang, Siteng Huang, Songyang Gao, Chengwei Qin, Kejian Wu, Zhaoxin Fan, Ziyue Qiao, Donglin Wang
cs.AI

摘要

訓練視覺-語言模型(VLMs)通常需要大規模、高質量的圖像-文本對,但收集或合成此類數據成本高昂。相比之下,文本數據豐富且成本低廉,這促使我們思考:能否僅從文本中合成高質量的多模態訓練數據?為解決這一問題,我們提出了一個跨整合的三階段多模態數據合成框架,該框架生成了兩個數據集:Unicorn-1.2M 和 Unicorn-471K-Instruction。在第一階段:多樣化字幕數據合成中,我們通過使用大型語言模型(LLMs)擴展稀疏的字幕種子,構建了120萬個語義多樣的高質量字幕。在第二階段:指令微調數據生成中,我們進一步將47.1萬個字幕處理成多輪指令微調任務,以支持複雜推理。最後,在第三階段:模態表示轉換中,這些文本字幕表示被轉化為視覺表示,從而生成多樣的合成圖像表示。這一三階段過程使我們能夠構建用於預訓練的Unicorn-1.2M和用於指令微調的Unicorn-471K-Instruction,而無需依賴真實圖像。通過在保持數據質量和多樣性的同時消除對真實圖像的依賴,我們的框架為VLMs訓練提供了一種成本效益高且可擴展的解決方案。代碼可在https://github.com/Yu-xm/Unicorn.git獲取。
English
Training vision-language models (VLMs) typically requires large-scale, high-quality image-text pairs, but collecting or synthesizing such data is costly. In contrast, text data is abundant and inexpensive, prompting the question: can high-quality multimodal training data be synthesized purely from text? To tackle this, we propose a cross-integrated three-stage multimodal data synthesis framework, which generates two datasets: Unicorn-1.2M and Unicorn-471K-Instruction. In Stage 1: Diverse Caption Data Synthesis, we construct 1.2M semantically diverse high-quality captions by expanding sparse caption seeds using large language models (LLMs). In Stage 2: Instruction-Tuning Data Generation, we further process 471K captions into multi-turn instruction-tuning tasks to support complex reasoning. Finally, in Stage 3: Modality Representation Transfer, these textual captions representations are transformed into visual representations, resulting in diverse synthetic image representations. This three-stage process enables us to construct Unicorn-1.2M for pretraining and Unicorn-471K-Instruction for instruction-tuning, without relying on real images. By eliminating the dependency on real images while maintaining data quality and diversity, our framework offers a cost-effective and scalable solution for VLMs training. Code is available at https://github.com/Yu-xm/Unicorn.git.

Summary

AI-Generated Summary

PDF382April 1, 2025