重路由，勿移除：视觉语言模型的可恢复视觉令牌路由

摘要

视觉-语言模型（VLMs）将图像投影为成百上千个视觉标记，导致解码器在注意力计算和KV缓存显存上的推理成本高昂。现有的视觉标记削减方法大多遵循"排序-移除"范式：对视觉标记进行评分，保留紧凑子集，并永久丢弃其余标记。我们发现这种不可逆操作存在脆弱性——视觉标记的重要性随解码器深度而变化；在某一阶段排名较低的标记可能在后续层变得重要，尤其对于依赖细粒度定位能力的查询。为此，我们提出Reroute，一种无需训练的即插即用组件，将移除替换为可恢复路由。在每个路由阶段，选中的视觉标记通过解码器模块，而被暂缓的标记则绕过该阶段，在下一个路由决策时重新进入候选池。Reroute复用现有的注意力分数排序规则和阶段级调度策略，保持了所增强剪枝方法的理论TFLOPS和KV缓存预算类别。在基于LLaVA-1.5和Qwen骨干网络的FastV、PDrop及Nüwa变体上，Reroute在激进标记削减条件下提升了细粒度定位能力，同时保持通用VQA性能。这些结果表明，VLM的标记削减不应仅被视为不可逆剪枝，还应被视为可恢复路由。代码见：https://github.com/elmma/mllm-reroute/

English

Vision-language models (VLMs) project images into hundreds to thousands of visual tokens, making decoder inference expensive in both attention computation and KV-cache memory. Existing visual-token reduction methods largely follow a rank-and-remove paradigm: they score visual tokens, keep a compact subset, and permanently discard the rest. We show that this irreversible action is fragile because visual-token importance changes across decoder depth; tokens ranked low at one stage may become relevant in later layers, especially for grounding-sensitive queries. We propose Reroute, a training-free plug-in that replaces removal with recoverable routing. At each routing stage, selected vision tokens pass through decoder blocks, while deferred tokens bypass the stage and re-enter the candidate pool at the next routing decision. Reroute reuses existing attention-score ranking rules and stage-wise schedules, preserving the theoretical TFLOPs and KV-cache budget class of the pruning method it augments. Across FastV, PDrop, and Nüwa variants on LLaVA-1.5 and Qwen backbones, reroute improves grounding under aggressive token reduction while maintaining general VQA performance. These results suggest that VLM token reduction should not be viewed only as irreversible pruning, but also as recoverable routing. The code can be found here: https://github.com/elmma/mllm-reroute/