後期層融合已足夠：在視覺飽和下多模態大型語言模型的雙路徑視覺令牌路由

摘要

多模态大語言模型（MLLMs）普遍繼承了專為單模態文本建模設計的深層對稱 Transformer 骨幹網路，並對影像與語言標記施加相同的統一計算。這種設計忽略了關鍵的模態不對稱性：影像與文本標記在資訊密度、冗餘程度及所需推理深度上存在顯著差異。透過對 LLaVA-1.5 進行逐層分析，我們觀察到視覺標記傾向於在中層達到飽和。具體而言，文本對影像的注意力從第 0 層的 0.68 降至第 4 層的 0.07，並在第 18 層後穩定在 0.04 附近，而文本標記則持續受益於深層語義處理。這些發現顯示，架構上的對稱性與依深度異步變化的模態演化之間存在不匹配，導致在深度任務特定調適過程中，出現冗餘的視覺計算及感知表徵的可能偏移。基於此，我們提出雙路徑視覺標記路由（DPVR），一種適用於高效 MLLMs 的模態不對稱路由框架。其核心實作 DPVR-LF（晚層融合）會在飽和點將視覺標記路由至可訓練的單層側分支，執行跳過深層堆疊中影像位置的十三層純文本前向傳遞，並僅在最終層重新融合視覺與文本流。DPVR-LF 僅需約 3% 的可訓練參數，即可在標準基準測試中維持具競爭力的多模態性能，同時減少深層 Transformer 堆疊中的視覺計算量。此結果挑戰了視覺標記必須遍歷所有深層語言模型層的傳統假設，並指出單一後期融合層即足以在 LLaVA 風格的多模態大語言模型中維持強大的感知能力。

English

Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a key modality asymmetry: image and text tokens differ substantially in information density, redundancy, and required reasoning depth. Through a layer-wise analysis of LLaVA-1.5, we observe that vision tokens tend to saturate in the middle layers. Specifically, text-to-image attention decreases from 0.68 at layer 0 to 0.07 by layer 4, and stabilizes near 0.04 after layer 18, whereas text tokens continue to benefit from deep semantic processing. These findings suggest a mismatch between architectural symmetry and depth-asynchronous modality evolution, resulting in redundant visual computation and possible drift in perceptual representations during deep task-specific adaptation. Motivated by this, we propose Dual-Path Vision Token Routing (DPVR), a modality-asymmetric routing framework for efficient MLLMs. Its core instantiation, DPVR-LF (Late-Layer Fusion), routes vision tokens at the saturation point into a one-layer trainable side branch, runs a thirteen-layer text-only forward that skips image positions in the deep stack, and re-fuses the visual and textual streams only at the final layer. With approximately 3% trainable parameters, DPVR-LF preserves competitive multimodal performance on standard benchmarks while reducing visual computation in the deep Transformer stack. The results challenge the conventional assumption that vision tokens must traverse all deep language-model layers, and indicate that a single late fusion layer can be sufficient for maintaining strong perceptual competence in LLaVA-style MLLMs.