後期層融合で十分：視覚飽和下のマルチモーダル大規模言語モデルのためのデュアルパス視覚トークンルーティング

要旨

マルチモーダル大規模言語モデル（MLLM）は一般に、単一モーダルテキストモデリング向けに設計された深い対称Transformerバックボーンを継承し、画像トークンと言語トークンに対して同一の計算を均等に適用する。この設計は、画像トークンとテキストトークンが情報密度、冗長性、必要とされる推論の深さにおいて本質的に異なるという、重要なモーダル非対称性を見落としている。LLaVA-1.5の層別分析を通じて、視覚トークンは中間層で飽和する傾向があることが観察された。具体的には、テキストから画像への注意は層0で0.68から層4で0.07に減少し、層18以降は0.04近傍で安定する一方、テキストトークンは引き続き深い意味処理の恩恵を受ける。これらの知見は、アーキテクチャの対称性と深さ非同期なモーダル進化の間に不一致があることを示唆しており、その結果、深いタスク特化適応中に冗長な視覚計算と知覚表現の潜在的なドリフトが生じる。この動機に基づき、我々は効率的なMLLMのためのモーダル非対称ルーティングフレームワークであるDual-Path Vision Token Routing（DPVR）を提案する。その中核的実装であるDPVR-LF（Late-Layer Fusion）は、視覚トークンを飽和点で1層の訓練可能なサイドブランチにルーティングし、深層スタック内で画像位置をスキップする13層のテキスト専用フォワードを実行し、最終層でのみ視覚ストリームとテキストストリームを再融合する。約3%の訓練可能パラメータで、DPVR-LFは標準ベンチマークにおいて競争力のあるマルチモーダル性能を維持しつつ、深層Transformerスタック内の視覚計算を削減する。この結果は、視覚トークンがすべての深層言語モデル層を通過しなければならないという従来の前提に疑問を投げかけ、LLaVAスタイルのMLLMにおいて単一の後期融合層が強力な知覚能力を維持するのに十分である可能性を示している。

English

Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a key modality asymmetry: image and text tokens differ substantially in information density, redundancy, and required reasoning depth. Through a layer-wise analysis of LLaVA-1.5, we observe that vision tokens tend to saturate in the middle layers. Specifically, text-to-image attention decreases from 0.68 at layer 0 to 0.07 by layer 4, and stabilizes near 0.04 after layer 18, whereas text tokens continue to benefit from deep semantic processing. These findings suggest a mismatch between architectural symmetry and depth-asynchronous modality evolution, resulting in redundant visual computation and possible drift in perceptual representations during deep task-specific adaptation. Motivated by this, we propose Dual-Path Vision Token Routing (DPVR), a modality-asymmetric routing framework for efficient MLLMs. Its core instantiation, DPVR-LF (Late-Layer Fusion), routes vision tokens at the saturation point into a one-layer trainable side branch, runs a thirteen-layer text-only forward that skips image positions in the deep stack, and re-fuses the visual and textual streams only at the final layer. With approximately 3% trainable parameters, DPVR-LF preserves competitive multimodal performance on standard benchmarks while reducing visual computation in the deep Transformer stack. The results challenge the conventional assumption that vision tokens must traverse all deep language-model layers, and indicate that a single late fusion layer can be sufficient for maintaining strong perceptual competence in LLaVA-style MLLMs.