视觉-语言-视觉自动编码器：基于扩散模型的可扩展知识蒸馏

摘要

构建具备强大图像描述能力的先进视觉-语言模型（VLMs），通常需要在数十亿高质量图文对上进行训练，消耗数百万GPU小时。本文提出了视觉-语言-视觉（VLV）自编码器框架，该框架巧妙地利用了关键预训练组件：视觉编码器、文本到图像（T2I）扩散模型的解码器，以及后续的大型语言模型（LLM）。具体而言，我们通过冻结预训练的T2I扩散解码器，对语言表示空间进行正则化，从而建立信息瓶颈。我们的VLV管道利用连续嵌入有效地从文本条件扩散模型中提炼知识，通过高质量重建展示了全面的语义理解能力。此外，通过微调预训练的LLM以将中间语言表示解码为详细描述，我们构建了一个与GPT-4o和Gemini 2.0 Flash等领先模型相媲美的先进图像描述器。我们的方法展现了卓越的成本效益，并显著降低了数据需求；主要通过使用单模态图像进行训练，并最大化现有预训练模型（图像编码器、T2I扩散模型和LLM）的效用，避免了大规模图文数据集的需求，将总训练成本控制在1000美元以内。

English

Building state-of-the-art Vision-Language Models (VLMs) with strong captioning capabilities typically necessitates training on billions of high-quality image-text pairs, requiring millions of GPU hours. This paper introduces the Vision-Language-Vision (VLV) auto-encoder framework, which strategically leverages key pretrained components: a vision encoder, the decoder of a Text-to-Image (T2I) diffusion model, and subsequently, a Large Language Model (LLM). Specifically, we establish an information bottleneck by regularizing the language representation space, achieved through freezing the pretrained T2I diffusion decoder. Our VLV pipeline effectively distills knowledge from the text-conditioned diffusion model using continuous embeddings, demonstrating comprehensive semantic understanding via high-quality reconstructions. Furthermore, by fine-tuning a pretrained LLM to decode the intermediate language representations into detailed descriptions, we construct a state-of-the-art (SoTA) captioner comparable to leading models like GPT-4o and Gemini 2.0 Flash. Our method demonstrates exceptional cost-efficiency and significantly reduces data requirements; by primarily utilizing single-modal images for training and maximizing the utility of existing pretrained models (image encoder, T2I diffusion model, and LLM), it circumvents the need for massive paired image-text datasets, keeping the total training expenditure under $1,000 USD.

视觉-语言-视觉自动编码器：基于扩散模型的可扩展知识蒸馏

Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models

摘要

Support