零樣本視覺編碼器嫁接技術:基於大型語言模型的代理方法
Zero-Shot Vision Encoder Grafting via LLM Surrogates
May 28, 2025
作者: Kaiyu Yue, Vasu Singla, Menglin Jia, John Kirchenbauer, Rifaa Qadri, Zikui Cai, Abhinav Bhatele, Furong Huang, Tom Goldstein
cs.AI
摘要
視覺語言模型(VLMs)通常將一個中等規模的視覺編碼器與一個大型語言模型(LLM,如Llama-70B)配對,這使得解碼器成為訓練過程中的主要計算負擔。為了降低成本,一個潛在且具有前景的策略是首先使用一個小型語言模型訓練視覺編碼器,然後再將其遷移到大型模型上。我們構建了小型“代理模型”,這些模型通過直接繼承大型目標LLM的淺層,共享相同的嵌入空間和表示語言。在代理模型上訓練的視覺編碼器可以直接遷移到更大的模型中,這一過程我們稱之為零次嫁接——當直接插入到完整尺寸的目標LLM時,嫁接後的組合超越了編碼器-代理模型的組合,在某些基準測試中,甚至與使用目標LLM進行完整解碼器訓練的性能相當。此外,當使用Llama-70B作為解碼器時,我們的代理訓練方法將整體VLM訓練成本降低了約45%。
English
Vision language models (VLMs) typically pair a modestly sized vision encoder
with a large language model (LLM), e.g., Llama-70B, making the decoder the
primary computational burden during training. To reduce costs, a potential
promising strategy is to first train the vision encoder using a small language
model before transferring it to the large one. We construct small "surrogate
models" that share the same embedding space and representation language as the
large target LLM by directly inheriting its shallow layers. Vision encoders
trained on the surrogate can then be directly transferred to the larger model,
a process we call zero-shot grafting -- when plugged directly into the
full-size target LLM, the grafted pair surpasses the encoder-surrogate pair
and, on some benchmarks, even performs on par with full decoder training with
the target LLM. Furthermore, our surrogate training approach reduces overall
VLM training costs by ~45% when using Llama-70B as the decoder.Summary
AI-Generated Summary