零样本视觉编码器嫁接通过大语言模型代理实现
Zero-Shot Vision Encoder Grafting via LLM Surrogates
May 28, 2025
作者: Kaiyu Yue, Vasu Singla, Menglin Jia, John Kirchenbauer, Rifaa Qadri, Zikui Cai, Abhinav Bhatele, Furong Huang, Tom Goldstein
cs.AI
摘要
视觉语言模型(VLMs)通常将中等规模的视觉编码器与大型语言模型(LLM)配对,例如Llama-70B,这使得解码器成为训练过程中的主要计算负担。为降低成本,一种潜在的有效策略是首先使用小型语言模型训练视觉编码器,然后再将其迁移至大型模型。我们构建了小型“代理模型”,这些模型通过直接继承目标大型LLM的浅层,共享相同的嵌入空间和表示语言。在代理模型上训练的视觉编码器随后可直接迁移至更大的模型,这一过程我们称之为零次嫁接——当直接接入完整尺寸的目标LLM时,嫁接后的组合不仅超越了编码器-代理模型对,在某些基准测试中,其表现甚至与使用目标LLM进行完整解码器训练相当。此外,当采用Llama-70B作为解码器时,我们的代理训练方法将整体VLM训练成本降低了约45%。
English
Vision language models (VLMs) typically pair a modestly sized vision encoder
with a large language model (LLM), e.g., Llama-70B, making the decoder the
primary computational burden during training. To reduce costs, a potential
promising strategy is to first train the vision encoder using a small language
model before transferring it to the large one. We construct small "surrogate
models" that share the same embedding space and representation language as the
large target LLM by directly inheriting its shallow layers. Vision encoders
trained on the surrogate can then be directly transferred to the larger model,
a process we call zero-shot grafting -- when plugged directly into the
full-size target LLM, the grafted pair surpasses the encoder-surrogate pair
and, on some benchmarks, even performs on par with full decoder training with
the target LLM. Furthermore, our surrogate training approach reduces overall
VLM training costs by ~45% when using Llama-70B as the decoder.Summary
AI-Generated Summary