视觉-语言-视觉自动编码器:基于扩散模型的可扩展知识蒸馏
Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models
July 9, 2025
作者: Tiezheng Zhang, Yitong Li, Yu-cheng Chou, Jieneng Chen, Alan Yuille, Chen Wei, Junfei Xiao
cs.AI
摘要
构建具备强大图像描述能力的先进视觉-语言模型(VLMs),通常需要在数十亿高质量图文对上进行训练,消耗数百万GPU小时。本文提出了视觉-语言-视觉(VLV)自编码器框架,该框架巧妙地利用了关键预训练组件:视觉编码器、文本到图像(T2I)扩散模型的解码器,以及后续的大型语言模型(LLM)。具体而言,我们通过冻结预训练的T2I扩散解码器,对语言表示空间进行正则化,从而建立信息瓶颈。我们的VLV管道利用连续嵌入有效地从文本条件扩散模型中提炼知识,通过高质量重建展示了全面的语义理解能力。此外,通过微调预训练的LLM以将中间语言表示解码为详细描述,我们构建了一个与GPT-4o和Gemini 2.0 Flash等领先模型相媲美的先进图像描述器。我们的方法展现了卓越的成本效益,并显著降低了数据需求;主要通过使用单模态图像进行训练,并最大化现有预训练模型(图像编码器、T2I扩散模型和LLM)的效用,避免了大规模图文数据集的需求,将总训练成本控制在1000美元以内。
English
Building state-of-the-art Vision-Language Models (VLMs) with strong
captioning capabilities typically necessitates training on billions of
high-quality image-text pairs, requiring millions of GPU hours. This paper
introduces the Vision-Language-Vision (VLV) auto-encoder framework, which
strategically leverages key pretrained components: a vision encoder, the
decoder of a Text-to-Image (T2I) diffusion model, and subsequently, a Large
Language Model (LLM). Specifically, we establish an information bottleneck by
regularizing the language representation space, achieved through freezing the
pretrained T2I diffusion decoder. Our VLV pipeline effectively distills
knowledge from the text-conditioned diffusion model using continuous
embeddings, demonstrating comprehensive semantic understanding via high-quality
reconstructions. Furthermore, by fine-tuning a pretrained LLM to decode the
intermediate language representations into detailed descriptions, we construct
a state-of-the-art (SoTA) captioner comparable to leading models like GPT-4o
and Gemini 2.0 Flash. Our method demonstrates exceptional cost-efficiency and
significantly reduces data requirements; by primarily utilizing single-modal
images for training and maximizing the utility of existing pretrained models
(image encoder, T2I diffusion model, and LLM), it circumvents the need for
massive paired image-text datasets, keeping the total training expenditure
under $1,000 USD.