제거하지 말고 경로 재설정: 비전-언어 모델을 위한 회복 가능한 시각 토큰 라우팅

초록

비전-언어 모델(VLM)은 이미지를 수백에서 수천 개의 시각 토큰으로 투영하여, 디코더 추론에서 어텐션 계산과 KV-캐시 메모리 모두에 높은 비용을 초래한다. 기존의 시각 토큰 축소 방법은 대부분 순위화 및 제거 패러다임을 따른다. 즉, 시각 토큰에 점수를 매기고, 소형 부분집합을 유지하며, 나머지는 영구히 폐기한다. 본 연구는 이러한 되돌릴 수 없는 조치가 취약함을 보여주는데, 시각 토큰의 중요성은 디코더 깊이에 따라 변화하며, 한 단계에서 낮게 순위가 매겨진 토큰이 이후 레이어, 특히 접지 민감 쿼리에서 관련성을 가질 수 있기 때문이다. 우리는 제거를 복구 가능한 라우팅으로 대체하는 훈련 없는 플러그인인 Reroute를 제안한다. 각 라우팅 단계에서 선택된 시각 토큰은 디코더 블록을 통과하는 반면, 지연된 토큰은 해당 단계를 우회하여 다음 라우팅 결정 시 후보 풀에 재진입한다. Reroute는 기존 어텐션 점수 순위 규칙과 단계별 스케줄을 재사용하여, 이를 보강하는 가지치기 방법의 이론적 TFLOPs 및 KV-캐시 예산 클래스를 유지한다. LLaVA-1.5 및 Qwen 백본 상의 FastV, PDrop, Nüwa 변형 전반에 걸쳐, Reroute는 공격적인 토큰 축소 하에서 접지 성능을 개선하면서 일반 VQA 성능을 유지한다. 이러한 결과는 VLM 토큰 축소가 되돌릴 수 없는 가지치기로만 간주되어서는 안 되며, 복구 가능한 라우팅으로도 간주되어야 함을 시사한다. 코드는 다음에서 확인할 수 있다: https://github.com/elmma/mllm-reroute/

English

Vision-language models (VLMs) project images into hundreds to thousands of visual tokens, making decoder inference expensive in both attention computation and KV-cache memory. Existing visual-token reduction methods largely follow a rank-and-remove paradigm: they score visual tokens, keep a compact subset, and permanently discard the rest. We show that this irreversible action is fragile because visual-token importance changes across decoder depth; tokens ranked low at one stage may become relevant in later layers, especially for grounding-sensitive queries. We propose Reroute, a training-free plug-in that replaces removal with recoverable routing. At each routing stage, selected vision tokens pass through decoder blocks, while deferred tokens bypass the stage and re-enter the candidate pool at the next routing decision. Reroute reuses existing attention-score ranking rules and stage-wise schedules, preserving the theoretical TFLOPs and KV-cache budget class of the pruning method it augments. Across FastV, PDrop, and Nüwa variants on LLaVA-1.5 and Qwen backbones, reroute improves grounding under aggressive token reduction while maintaining general VQA performance. These results suggest that VLM token reduction should not be viewed only as irreversible pruning, but also as recoverable routing. The code can be found here: https://github.com/elmma/mllm-reroute/