Synth^2：合成キャプションと画像埋め込みによる視覚-言語モデルの強化

要旨

高品質な人間によるラベル付けが施された画像-キャプションデータセットの作成は、視覚言語モデル（VLM）の開発における大きなボトルネックとなっています。本研究では、大規模言語モデル（LLM）と画像生成モデルの強みを活用し、効率的かつ効果的なVLMトレーニングのための合成画像-テキストペアを生成する新しいアプローチを提案します。私たちの手法では、LLMによって生成されたキャプションから始めて、テキストから画像へのモデルを事前学習させ、画像埋め込みを合成します。これらの合成ペアは、VLMのトレーニングに使用されます。大規模な実験により、合成データでトレーニングされたVLMは、画像キャプション生成において同等の性能を示し、人間による注釈データのみでトレーニングされたモデルに比べて必要なデータ量が大幅に少ないことが実証されました。特に、合成データセットによる拡張により、ベースラインを17%上回る性能を達成しました。さらに、画像埋め込み空間での合成は、ピクセル空間での合成に比べて25%高速であることを示しました。この研究は、大規模でカスタマイズ可能な画像データセットを生成する有望な技術を導入し、データ効率とリソース活用の向上を通じて、VLMの性能向上とさまざまな分野での幅広い適用可能性をもたらします。

English

The creation of high-quality human-labeled image-caption datasets presents a significant bottleneck in the development of Visual-Language Models (VLMs). We propose a novel approach that leverages the strengths of Large Language Models (LLMs) and image generation models to create synthetic image-text pairs for efficient and effective VLM training. Our method employs pretraining a text-to-image model to synthesize image embeddings starting from captions generated by an LLM. These synthetic pairs are then used to train a VLM. Extensive experiments demonstrate that the VLM trained with synthetic data exhibits comparable performance on image captioning, while requiring a fraction of the data used by models trained solely on human-annotated data. In particular, we outperform the baseline by 17% through augmentation with a synthetic dataset. Furthermore, we show that synthesizing in the image embedding space is 25% faster than in the pixel space. This research introduces a promising technique for generating large-scale, customizable image datasets, leading to enhanced VLM performance and wider applicability across various domains, all with improved data efficiency and resource utilization.

Synth^2：合成キャプションと画像埋め込みによる視覚-言語モデルの強化

Synth^2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

要旨

Support