后层融合足矣：视觉饱和下多模态大语言模型的双路径视觉令牌路由

摘要

多模态大语言模型（MLLMs）普遍继承了专为单模态文本建模设计的深层对称Transformer架构，并对图像和语言token施加相同的计算处理。这种设计忽略了一个关键的模态不对称性：图像与文本token在信息密度、冗余度及所需推理深度上存在显著差异。通过对LLaVA-1.5进行逐层分析，我们发现视觉token往往在中间层达到饱和。具体而言，文本到图像的注意力从第0层的0.68降至第4层的0.07，并在第18层后稳定在0.04附近，而文本token则持续受益于深层语义处理。这些发现表明，架构对称性与深度异步的模态演化之间存在不匹配，导致在深度任务特定适应过程中产生冗余的视觉计算以及可能的感知表征偏移。受此启发，我们提出了一种面向高效MLLMs的模态不对称路由框架——双路径视觉Token路由（DPVR）。其核心实例化方案DPVR-LF（后期层融合）在饱和点将视觉token路由至一个单层可训练的侧分支，随后在深层堆栈中执行一个十三层的纯文本前向传播（跳过图像位置），仅在最终层重新融合视觉与文本流。DPVR-LF仅引入约3%的可训练参数，即可在标准基准测试中保持具有竞争力的多模态性能，同时大幅减少深层Transformer堆栈中的视觉计算。该结果挑战了视觉token必须贯穿所有深层语言模型层的传统假设，并表明单一的后期融合层足以在LLaVA风格的MLLMs中维持强大的感知能力。

English

Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a key modality asymmetry: image and text tokens differ substantially in information density, redundancy, and required reasoning depth. Through a layer-wise analysis of LLaVA-1.5, we observe that vision tokens tend to saturate in the middle layers. Specifically, text-to-image attention decreases from 0.68 at layer 0 to 0.07 by layer 4, and stabilizes near 0.04 after layer 18, whereas text tokens continue to benefit from deep semantic processing. These findings suggest a mismatch between architectural symmetry and depth-asynchronous modality evolution, resulting in redundant visual computation and possible drift in perceptual representations during deep task-specific adaptation. Motivated by this, we propose Dual-Path Vision Token Routing (DPVR), a modality-asymmetric routing framework for efficient MLLMs. Its core instantiation, DPVR-LF (Late-Layer Fusion), routes vision tokens at the saturation point into a one-layer trainable side branch, runs a thirteen-layer text-only forward that skips image positions in the deep stack, and re-fuses the visual and textual streams only at the final layer. With approximately 3% trainable parameters, DPVR-LF preserves competitive multimodal performance on standard benchmarks while reducing visual computation in the deep Transformer stack. The results challenge the conventional assumption that vision tokens must traverse all deep language-model layers, and indicate that a single late fusion layer can be sufficient for maintaining strong perceptual competence in LLaVA-style MLLMs.