Synth^2：利用合成标题和图像嵌入增强视觉-语言模型

摘要

在视觉-语言模型（VLMs）的发展中，高质量的人工标注图像描述数据集的创建是一个重要的瓶颈。我们提出了一种新颖的方法，利用大型语言模型（LLMs）和图像生成模型的优势，为高效有效的VLM训练创建合成图像-文本对。我们的方法利用预训练的文本到图像模型，从由LLM生成的描述开始合成图像嵌入。这些合成对被用来训练VLM。大量实验证明，使用合成数据训练的VLM在图像描述方面表现出与仅使用人工标注数据训练的模型相媲美的性能，同时所需数据量仅为后者的一小部分。特别地，通过合成数据集的增强，我们超越基准线17%。此外，我们展示在图像嵌入空间中进行合成比在像素空间中快25%。这项研究介绍了一种有前景的技术，用于生成大规模、可定制的图像数据集，提高了VLM的性能，并在各个领域中具有更广泛的适用性，同时提高了数据效率和资源利用率。

English

The creation of high-quality human-labeled image-caption datasets presents a significant bottleneck in the development of Visual-Language Models (VLMs). We propose a novel approach that leverages the strengths of Large Language Models (LLMs) and image generation models to create synthetic image-text pairs for efficient and effective VLM training. Our method employs pretraining a text-to-image model to synthesize image embeddings starting from captions generated by an LLM. These synthetic pairs are then used to train a VLM. Extensive experiments demonstrate that the VLM trained with synthetic data exhibits comparable performance on image captioning, while requiring a fraction of the data used by models trained solely on human-annotated data. In particular, we outperform the baseline by 17% through augmentation with a synthetic dataset. Furthermore, we show that synthesizing in the image embedding space is 25% faster than in the pixel space. This research introduces a promising technique for generating large-scale, customizable image datasets, leading to enhanced VLM performance and wider applicability across various domains, all with improved data efficiency and resource utilization.

Synth^2：利用合成标题和图像嵌入增强视觉-语言模型

Synth^2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

摘要

Support