推論最適なVLMは、1つのビジュアルトークンだけで済むが、より大きなモデルが必要となる。

要旨

ビジョン言語モデル（VLMs）は、さまざまな視覚理解および推論タスクで強力な能力を示しています。ただし、LLMによって大量の入力トークン（主に画像から）を処理するために必要な計算量が多いため、推論中の遅延が高く、実世界での展開が制約されることがしばしばあります。推論コストを削減するためには、LLMを縮小するか、入力画像トークンの数を減らすことができます。後者は、トークンの圧縮を中心にした多くの最近の研究の焦点となっています。ただし、どちらが最適なトレードオフであるかは不明です。なぜなら、両方の要因がVLMのパフォーマンスに直接影響を与えるからです。我々は、これらの2つの要因によるパフォーマンスの変動を捉えるスケーリング則を確立することにより、視覚トークンの数とLLMパラメータとの間の最適なトレードオフを最初に特徴付けます。我々の結果は、驚くべきトレンドを示しています。視覚推論タスクにおいて、VLMにおける推論最適な振る舞い、つまり、任意の固定推論計算において最小のダウンストリームエラーが達成されるのは、推論予算内に収まる最大のLLMを使用することであり、視覚トークン数を最小限に抑えることです。トークンの削減に関する文献は、基本モデルのパフォーマンスをわずかに向上させることに主に焦点を当ててきましたが、我々の結果は、計算最適な推論領域では、より高いトークン圧縮比率で運用する必要があることを示しています。これらの知見に基づき、高度なトークン圧縮設定に適したアプローチの構築に向けて初期段階の取り組みを行っています。コードは以下のURLから入手できます：https://github.com/locuslab/llava-token-compression.

English

Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks. However, their real-world deployment is often constrained by high latency during inference due to substantial compute required to process the large number of input tokens (predominantly from the image) by the LLM. To reduce inference costs, one can either downsize the LLM or reduce the number of input image-tokens, the latter of which has been the focus of many recent works around token compression. However, it is unclear what the optimal trade-off is, as both the factors directly affect the VLM performance. We first characterize this optimal trade-off between the number of visual tokens and LLM parameters by establishing scaling laws that capture variations in performance with these two factors. Our results reveal a surprising trend: for visual reasoning tasks, the inference-optimal behavior in VLMs, i.e., minimum downstream error at any given fixed inference compute, is achieved when using the largest LLM that fits within the inference budget while minimizing visual token count - often to a single token. While the token reduction literature has mainly focused on maintaining base model performance by modestly reducing the token count (e.g., 5-10times), our results indicate that the compute-optimal inference regime requires operating under even higher token compression ratios. Based on these insights, we take some initial steps towards building approaches tailored for high token compression settings. Code is available at https://github.com/locuslab/llava-token-compression.

推論最適なVLMは、1つのビジュアルトークンだけで済むが、より大きなモデルが必要となる。

Inference Optimal VLMs Need Only One Visual Token but Larger Models

要旨

Support