ビジョン-言語-ビジョン自動エンコーダ：拡散モデルからのスケーラブルな知識蒸留

要旨

最先端の視覚言語モデル（VLM）を構築し、強力なキャプション生成能力を実現するためには、通常、数十億の高品質な画像-テキストペアを用いたトレーニングと数百万GPU時間が必要とされます。本論文では、Vision-Language-Vision（VLV）オートエンコーダフレームワークを提案します。このフレームワークは、事前学習済みの主要コンポーネントを戦略的に活用します。具体的には、視覚エンコーダ、Text-to-Image（T2I）拡散モデルのデコーダ、そして大規模言語モデル（LLM）を利用します。特に、事前学習済みのT2I拡散デコーダを凍結することで、言語表現空間を正則化し、情報ボトルネックを確立します。VLVパイプラインは、テキスト条件付き拡散モデルから連続埋め込みを用いて知識を効果的に蒸留し、高品質な再構成を通じて包括的な意味理解を示します。さらに、事前学習済みのLLMを微調整して中間言語表現を詳細な記述にデコードすることで、GPT-4oやGemini 2.0 Flashのような主要モデルに匹敵する最先端のキャプション生成器を構築します。本手法は、優れたコスト効率性を示し、データ要件を大幅に削減します。主に単一モーダルの画像をトレーニングに使用し、既存の事前学習済みモデル（画像エンコーダ、T2I拡散モデル、LLM）の有用性を最大化することで、大規模な画像-テキストペアデータセットの必要性を回避し、総トレーニング費用を1,000米ドル未満に抑えます。

English

Building state-of-the-art Vision-Language Models (VLMs) with strong captioning capabilities typically necessitates training on billions of high-quality image-text pairs, requiring millions of GPU hours. This paper introduces the Vision-Language-Vision (VLV) auto-encoder framework, which strategically leverages key pretrained components: a vision encoder, the decoder of a Text-to-Image (T2I) diffusion model, and subsequently, a Large Language Model (LLM). Specifically, we establish an information bottleneck by regularizing the language representation space, achieved through freezing the pretrained T2I diffusion decoder. Our VLV pipeline effectively distills knowledge from the text-conditioned diffusion model using continuous embeddings, demonstrating comprehensive semantic understanding via high-quality reconstructions. Furthermore, by fine-tuning a pretrained LLM to decode the intermediate language representations into detailed descriptions, we construct a state-of-the-art (SoTA) captioner comparable to leading models like GPT-4o and Gemini 2.0 Flash. Our method demonstrates exceptional cost-efficiency and significantly reduces data requirements; by primarily utilizing single-modal images for training and maximizing the utility of existing pretrained models (image encoder, T2I diffusion model, and LLM), it circumvents the need for massive paired image-text datasets, keeping the total training expenditure under $1,000 USD.

ビジョン-言語-ビジョン自動エンコーダ：拡散モデルからのスケーラブルな知識蒸留

Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models

要旨

Support