Synth^2：透過合成標題和圖像嵌入來增強視覺語言模型

摘要

在視覺語言模型（VLMs）的發展中，高質量的人工標註圖像語句數據集的創建是一個重要的瓶頸。我們提出了一種新方法，利用大型語言模型（LLMs）和圖像生成模型的優勢，為高效和有效的VLM訓練創建合成圖像文本對。我們的方法採用預訓練文本到圖像模型，從LLM生成的標題開始合成圖像嵌入。然後使用這些合成對來訓練VLM。廣泛的實驗表明，使用合成數據訓練的VLM在圖像標題上表現出色，同時所需的數據量僅為僅使用人工標註數據訓練的模型的一小部分。特別是，通過合成數據集的增強，我們超越了基準線17％。此外，我們展示，在圖像嵌入空間中進行合成比在像素空間中快25％。這項研究引入了一種有前途的技術，用於生成大規模、可定制的圖像數據集，從而提高VLM的性能並在各個領域中擁有更廣泛的應用，同時提高了數據效率和資源利用率。

English

The creation of high-quality human-labeled image-caption datasets presents a significant bottleneck in the development of Visual-Language Models (VLMs). We propose a novel approach that leverages the strengths of Large Language Models (LLMs) and image generation models to create synthetic image-text pairs for efficient and effective VLM training. Our method employs pretraining a text-to-image model to synthesize image embeddings starting from captions generated by an LLM. These synthetic pairs are then used to train a VLM. Extensive experiments demonstrate that the VLM trained with synthetic data exhibits comparable performance on image captioning, while requiring a fraction of the data used by models trained solely on human-annotated data. In particular, we outperform the baseline by 17% through augmentation with a synthetic dataset. Furthermore, we show that synthesizing in the image embedding space is 25% faster than in the pixel space. This research introduces a promising technique for generating large-scale, customizable image datasets, leading to enhanced VLM performance and wider applicability across various domains, all with improved data efficiency and resource utilization.

Synth^2：透過合成標題和圖像嵌入來增強視覺語言模型

Synth^2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

摘要

Support