ChatPaper.aiChatPaper

視覺-語言-視覺自動編碼器:基於擴散模型的可擴展知識蒸餾

Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models

July 9, 2025
作者: Tiezheng Zhang, Yitong Li, Yu-cheng Chou, Jieneng Chen, Alan Yuille, Chen Wei, Junfei Xiao
cs.AI

摘要

構建具備強大描述能力的最先進視覺-語言模型(VLMs),通常需要對數十億高質量圖像-文本對進行訓練,耗費數百萬GPU小時。本文介紹了視覺-語言-視覺(VLV)自動編碼器框架,該框架策略性地利用了關鍵的預訓練組件:視覺編碼器、文本到圖像(T2I)擴散模型的解碼器,以及隨後的大型語言模型(LLM)。具體而言,我們通過凍結預訓練的T2I擴散解碼器,在語言表示空間中建立信息瓶頸,實現了對該空間的正則化。我們的VLV管道有效地從文本條件擴散模型中提取知識,使用連續嵌入展示了全面的語義理解,並通過高質量重建得以體現。此外,通過微調預訓練的LLM以將中間語言表示解碼為詳細描述,我們構建了一個與GPT-4o和Gemini 2.0 Flash等領先模型相媲美的最先進(SoTA)描述生成器。我們的方法展示了卓越的成本效益,並顯著降低了數據需求;主要利用單模態圖像進行訓練,並最大化現有預訓練模型(圖像編碼器、T2I擴散模型和LLM)的效用,從而避免了對大規模配對圖像-文本數據集的需求,使總訓練成本保持在1,000美元以下。
English
Building state-of-the-art Vision-Language Models (VLMs) with strong captioning capabilities typically necessitates training on billions of high-quality image-text pairs, requiring millions of GPU hours. This paper introduces the Vision-Language-Vision (VLV) auto-encoder framework, which strategically leverages key pretrained components: a vision encoder, the decoder of a Text-to-Image (T2I) diffusion model, and subsequently, a Large Language Model (LLM). Specifically, we establish an information bottleneck by regularizing the language representation space, achieved through freezing the pretrained T2I diffusion decoder. Our VLV pipeline effectively distills knowledge from the text-conditioned diffusion model using continuous embeddings, demonstrating comprehensive semantic understanding via high-quality reconstructions. Furthermore, by fine-tuning a pretrained LLM to decode the intermediate language representations into detailed descriptions, we construct a state-of-the-art (SoTA) captioner comparable to leading models like GPT-4o and Gemini 2.0 Flash. Our method demonstrates exceptional cost-efficiency and significantly reduces data requirements; by primarily utilizing single-modal images for training and maximizing the utility of existing pretrained models (image encoder, T2I diffusion model, and LLM), it circumvents the need for massive paired image-text datasets, keeping the total training expenditure under $1,000 USD.
PDF341July 16, 2025