LLM 서로게이트를 통한 제로샷 비전 인코더 접목

초록

비전 언어 모델(Vision Language Models, VLMs)은 일반적으로 중간 규모의 비전 인코더를 대형 언어 모델(Large Language Model, LLM), 예를 들어 Llama-70B와 결합하여 디코더를 훈련 과정에서 주요 계산 부담으로 만든다. 비용을 절감하기 위한 잠재적으로 유망한 전략은 큰 언어 모델로 전환하기 전에 작은 언어 모델을 사용하여 비전 인코더를 먼저 훈련시키는 것이다. 우리는 대형 목표 LLM의 얕은 층을 직접 상속함으로써 동일한 임베딩 공간과 표현 언어를 공유하는 작은 "대리 모델(surrogate models)"을 구축한다. 대리 모델에서 훈련된 비전 인코더는 이후 더 큰 모델로 직접 전환될 수 있으며, 이 과정을 우리는 제로샷 접목(zero-shot grafting)이라고 부른다. 전체 크기의 목표 LLM에 직접 연결될 때, 접목된 쌍은 인코더-대리 모델 쌍을 능가하며, 일부 벤치마크에서는 목표 LLM을 사용한 전체 디코더 훈련과도 동등한 성능을 보인다. 또한, Llama-70B를 디코더로 사용할 때 우리의 대리 모델 훈련 접근법은 전체 VLM 훈련 비용을 약 45% 줄인다.

English

Vision language models (VLMs) typically pair a modestly sized vision encoder with a large language model (LLM), e.g., Llama-70B, making the decoder the primary computational burden during training. To reduce costs, a potential promising strategy is to first train the vision encoder using a small language model before transferring it to the large one. We construct small "surrogate models" that share the same embedding space and representation language as the large target LLM by directly inheriting its shallow layers. Vision encoders trained on the surrogate can then be directly transferred to the larger model, a process we call zero-shot grafting -- when plugged directly into the full-size target LLM, the grafted pair surpasses the encoder-surrogate pair and, on some benchmarks, even performs on par with full decoder training with the target LLM. Furthermore, our surrogate training approach reduces overall VLM training costs by ~45% when using Llama-70B as the decoder.

LLM 서로게이트를 통한 제로샷 비전 인코더 접목

Zero-Shot Vision Encoder Grafting via LLM Surrogates

초록

Support