VISion On Request：疎で動的に選択される視覚言語相互作用によるVLLM効率の向上

要旨

大規模視覚言語モデル（LVLM）の効率化を図る既存のアプローチは、主に視覚トークンの削減という概念に基づいている。しかし、この手法は情報ボトルネックを生み出し、特に細粒度の理解と推論を要する困難なタスクにおいて性能を損なう。本研究では、このパラダイムに挑戦し、視覚情報を廃棄することなく推論コストを削減する手法であるVISion On Request（VISOR）を提案する。VISORは画像を圧縮する代わりに、画像トークンとテキストトークン間の相互作用を疎化することで効率を向上させる。具体的には、言語モデルは少数の戦略的に配置されたアテンション層を通じて、高解像度の視覚トークン全体に注目する。すなわち、テキストと画像間の効率的なクロスアテンションにより一般的な視覚コンテキストを提供し、適切に配置され動的に選択された少数のセルフアテンション層が視覚表現自体を精緻化し、必要に応じて複雑な高解像度推論を可能にする。この原理に基づき、まずセルフアテンション層の数を変えることで様々な計算予算に対応する単一の汎用ネットワークを学習し、次に、サンプルごとの複雑度に基づいて視覚計算を動的に割り当てる軽量なポリシーメカニズムを導入する。大規模な実験により、VISORが計算コストを大幅に削減しつつ、多様なベンチマーク群において最先端の結果を匹敵または凌駕し、詳細な視覚理解を要する困難なタスクで優れた性能を発揮することを示す。

English

Existing approaches for improving the efficiency of Large Vision-Language Models (LVLMs) are largely based on the concept of visual token reduction. This approach, however, creates an information bottleneck that impairs performance, especially on challenging tasks that require fine-grained understanding and reasoning. In this work, we challenge this paradigm by introducing VISion On Request (VISOR), a method that reduces inference cost without discarding visual information. Instead of compressing the image, VISOR improves efficiency by sparsifying the interaction between image and text tokens. Specifically, the language model attends to the full set of high-resolution visual tokens through a small, strategically placed set of attention layers: general visual context is provided by efficient cross-attention between text-image, while a few well-placed and dynamically selected self-attention layers refine the visual representations themselves, enabling complex, high-resolution reasoning when needed. Based on this principle, we first train a single universal network on a range of computational budgets by varying the number of self-attention layers, and then introduce a lightweight policy mechanism that dynamically allocates visual computation based on per-sample complexity. Extensive experiments show that VISOR drastically reduces computational cost while matching or exceeding state-of-the-art results across a diverse suite of benchmarks, and excels in challenging tasks that require detailed visual understanding.

VISion On Request：疎で動的に選択される視覚言語相互作用によるVLLM効率の向上

VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions

要旨

Support