除去せずに迂回せよ：視覚言語モデルのための回復可能な視覚トークンルーティング

要旨

Vision-Language Models（VLM）は、画像を数百から数千の視覚トークンに投影するため、デコーダ推論において注意機構の計算とKVキャッシュメモリの両方で高コストとなる。既存の視覚トークン削減手法は、主に「ランク付けして削除する」パラダイムに従う。すなわち、視覚トークンをスコアリングし、コンパクトなサブセットを保持し、残りを永続的に破棄する。本稿では、この不可逆的な操作が脆弱であることを示す。なぜなら、視覚トークンの重要度はデコーダの深さによって変化し、ある段階で低くランク付けされたトークンが、後続の層、特にグラウンディングに敏感なクエリにおいて重要になる可能性があるからである。我々はRerouteを提案する。これは学習を必要としないプラグインであり、削除を回復可能なルーティングに置き換える。各ルーティング段階において、選択された視覚トークンはデコーダブロックを通過する一方、延期されたトークンはその段階を迂回し、次のルーティング決定時に候補プールに再び入る。Rerouteは既存の注意スコアのランク付けルールと段階別スケジュールを再利用し、それが拡張するプルーニング手法の理論上のTFLOPsおよびKVキャッシュ予算のクラスを維持する。LLaVA-1.5およびQwenバックボーン上のFastV、PDrop、Nüwaの各バリアントにおいて、Rerouteは、攻撃的なトークン削減下でグラウンディングを改善しつつ、一般的なVQA性能を維持する。これらの結果は、VLMのトークン削減は不可逆的なプルーニングとしてだけでなく、回復可能なルーティングとしても見なされるべきであることを示唆している。コードはこちらで入手可能：https://github.com/elmma/mllm-reroute/

English

Vision-language models (VLMs) project images into hundreds to thousands of visual tokens, making decoder inference expensive in both attention computation and KV-cache memory. Existing visual-token reduction methods largely follow a rank-and-remove paradigm: they score visual tokens, keep a compact subset, and permanently discard the rest. We show that this irreversible action is fragile because visual-token importance changes across decoder depth; tokens ranked low at one stage may become relevant in later layers, especially for grounding-sensitive queries. We propose Reroute, a training-free plug-in that replaces removal with recoverable routing. At each routing stage, selected vision tokens pass through decoder blocks, while deferred tokens bypass the stage and re-enter the candidate pool at the next routing decision. Reroute reuses existing attention-score ranking rules and stage-wise schedules, preserving the theoretical TFLOPs and KV-cache budget class of the pruning method it augments. Across FastV, PDrop, and Nüwa variants on LLaVA-1.5 and Qwen backbones, reroute improves grounding under aggressive token reduction while maintaining general VQA performance. These results suggest that VLM token reduction should not be viewed only as irreversible pruning, but also as recoverable routing. The code can be found here: https://github.com/elmma/mllm-reroute/