Unicorn: Synthese van uitsluitend tekstgegevens voor de training van visuele taalmodelen

Samenvatting

Het trainen van vision-language modellen (VLMs) vereist doorgaans grootschalige, hoogwaardige afbeelding-tekst paren, maar het verzamelen of synthetiseren van dergelijke data is kostbaar. Daarentegen is tekstdata overvloedig en goedkoop, wat de vraag oproept: kan hoogwaardige multimodale trainingsdata puur uit tekst worden gesynthetiseerd? Om dit aan te pakken, stellen we een cross-geïntegreerd drie-fasen multimodaal data synthese framework voor, dat twee datasets genereert: Unicorn-1.2M en Unicorn-471K-Instruction. In Fase 1: Diverse Caption Data Synthese, construeren we 1.2M semantisch diverse hoogwaardige bijschriften door spaarzame bijschriftzaden uit te breiden met behulp van grote taalmodellen (LLMs). In Fase 2: Instruction-Tuning Data Generatie, verwerken we verder 471K bijschriften in meerlagige instruction-tuning taken om complexe redenering te ondersteunen. Ten slotte, in Fase 3: Modality Representation Transfer, worden deze tekstuele bijschrift representaties omgezet in visuele representaties, wat resulteert in diverse synthetische afbeeldingsrepresentaties. Dit drie-fasen proces stelt ons in staat om Unicorn-1.2M te construeren voor pretraining en Unicorn-471K-Instruction voor instruction-tuning, zonder afhankelijk te zijn van echte afbeeldingen. Door de afhankelijkheid van echte afbeeldingen te elimineren terwijl de data kwaliteit en diversiteit behouden blijft, biedt ons framework een kosteneffectieve en schaalbare oplossing voor VLMs training. Code is beschikbaar op https://github.com/Yu-xm/Unicorn.git.

English

Training vision-language models (VLMs) typically requires large-scale, high-quality image-text pairs, but collecting or synthesizing such data is costly. In contrast, text data is abundant and inexpensive, prompting the question: can high-quality multimodal training data be synthesized purely from text? To tackle this, we propose a cross-integrated three-stage multimodal data synthesis framework, which generates two datasets: Unicorn-1.2M and Unicorn-471K-Instruction. In Stage 1: Diverse Caption Data Synthesis, we construct 1.2M semantically diverse high-quality captions by expanding sparse caption seeds using large language models (LLMs). In Stage 2: Instruction-Tuning Data Generation, we further process 471K captions into multi-turn instruction-tuning tasks to support complex reasoning. Finally, in Stage 3: Modality Representation Transfer, these textual captions representations are transformed into visual representations, resulting in diverse synthetic image representations. This three-stage process enables us to construct Unicorn-1.2M for pretraining and Unicorn-471K-Instruction for instruction-tuning, without relying on real images. By eliminating the dependency on real images while maintaining data quality and diversity, our framework offers a cost-effective and scalable solution for VLMs training. Code is available at https://github.com/Yu-xm/Unicorn.git.

Unicorn: Synthese van uitsluitend tekstgegevens voor de training van visuele taalmodelen

Unicorn: Text-Only Data Synthesis for Vision Language Model Training

Samenvatting

Support