ゼロショット視覚エンコーダグラフトング：LLMサロゲートを介して

要旨

視覚言語モデル（VLM）は、通常、比較的小規模な視覚エンコーダと大規模言語モデル（LLM、例：Llama-70B）を組み合わせており、訓練時の主要な計算負荷はデコーダに集中する。コスト削減のため、有望な戦略として、まず小規模な言語モデルを用いて視覚エンコーダを訓練し、その後大規模モデルに転送する方法が考えられる。本研究では、大規模な目標LLMの浅い層を直接継承することで、同じ埋め込み空間と表現言語を共有する小規模な「代理モデル」を構築した。代理モデル上で訓練された視覚エンコーダは、その後、大規模モデルに直接転送可能であり、このプロセスをゼロショットグラフトと呼ぶ。完全なサイズの目標LLMに直接接続された場合、グラフトされたペアはエンコーダと代理モデルのペアを上回り、一部のベンチマークでは、目標LLMを用いた完全なデコーダ訓練と同等の性能を発揮する。さらに、Llama-70Bをデコーダとして使用する場合、本手法による代理モデル訓練はVLMの総訓練コストを約45％削減する。

English

Vision language models (VLMs) typically pair a modestly sized vision encoder with a large language model (LLM), e.g., Llama-70B, making the decoder the primary computational burden during training. To reduce costs, a potential promising strategy is to first train the vision encoder using a small language model before transferring it to the large one. We construct small "surrogate models" that share the same embedding space and representation language as the large target LLM by directly inheriting its shallow layers. Vision encoders trained on the surrogate can then be directly transferred to the larger model, a process we call zero-shot grafting -- when plugged directly into the full-size target LLM, the grafted pair surpasses the encoder-surrogate pair and, on some benchmarks, even performs on par with full decoder training with the target LLM. Furthermore, our surrogate training approach reduces overall VLM training costs by ~45% when using Llama-70B as the decoder.

ゼロショット視覚エンコーダグラフトング：LLMサロゲートを介して

Zero-Shot Vision Encoder Grafting via LLM Surrogates

要旨

Support