후기 계층 융합만으로 충분하다: 시각적 포화 상태에서의 멀티모달 대규모 언어 모델을 위한 이중 경로 비전 토큰 라우팅

초록

멀티모달 대규모 언어 모델(MLLM)은 일반적으로 단일 모드 텍스트 모델링을 위해 설계된 깊고 대칭적인 Transformer 백본을 상속하며, 이미지와 언어 토큰에 동일한 계산을 균일하게 적용한다. 이러한 설계는 핵심적인 모드 비대칭성, 즉 이미지와 텍스트 토큰이 정보 밀도, 중복성 및 필요한 추론 깊이에서 상당히 다르다는 점을 간과한다. LLaVA-1.5의 계층별 분석을 통해 우리는 시각 토큰이 중간 계층에서 포화되는 경향이 있음을 관찰했다. 구체적으로, 텍스트-이미지 주의(attention)는 0층에서 0.68에서 4층에서 0.07로 감소하고, 18층 이후에는 0.04 근처에서 안정화되는 반면, 텍스트 토큰은 깊은 의미 처리를 통해 계속 이점을 얻는다. 이러한 발견은 구조적 대칭성과 깊이에 따른 비동기적 모드 진화 간의 불일치를 시사하며, 이는 중복된 시각 계산과 깊은 작업 특화 적응 중 지각 표현의 가능한 표류를 초래한다. 이에 동기 부여되어, 우리는 효율적인 MLLM을 위한 모드 비대칭 라우팅 프레임워크인 DPVR(Dual-Path Vision Token Routing)을 제안한다. 핵심 구현인 DPVR-LF(후기 계층 융합)는 포화 지점에서 시각 토큰을 단일 계층 학습 가능한 사이드 브랜치로 라우팅하고, 깊은 스택에서 이미지 위치를 건너뛰는 13개 계층의 텍스트 전용 순방향을 실행하며, 최종 계층에서만 시각 및 텍스트 스트림을 재융합한다. 약 3%의 학습 가능한 매개변수로 DPVR-LF는 표준 벤치마크에서 경쟁력 있는 멀티모달 성능을 유지하면서 깊은 Transformer 스택에서의 시각 계산을 줄인다. 이 결과는 시각 토큰이 모든 깊은 언어 모델 계층을 통과해야 한다는 기존 가정에 도전하며, 단일 후기 융합 계층이 LLaVA 스타일 MLLM에서 강력한 지각 능력을 유지하기에 충분할 수 있음을 시사한다.

English

Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a key modality asymmetry: image and text tokens differ substantially in information density, redundancy, and required reasoning depth. Through a layer-wise analysis of LLaVA-1.5, we observe that vision tokens tend to saturate in the middle layers. Specifically, text-to-image attention decreases from 0.68 at layer 0 to 0.07 by layer 4, and stabilizes near 0.04 after layer 18, whereas text tokens continue to benefit from deep semantic processing. These findings suggest a mismatch between architectural symmetry and depth-asynchronous modality evolution, resulting in redundant visual computation and possible drift in perceptual representations during deep task-specific adaptation. Motivated by this, we propose Dual-Path Vision Token Routing (DPVR), a modality-asymmetric routing framework for efficient MLLMs. Its core instantiation, DPVR-LF (Late-Layer Fusion), routes vision tokens at the saturation point into a one-layer trainable side branch, runs a thirteen-layer text-only forward that skips image positions in the deep stack, and re-fuses the visual and textual streams only at the final layer. With approximately 3% trainable parameters, DPVR-LF preserves competitive multimodal performance on standard benchmarks while reducing visual computation in the deep Transformer stack. The results challenge the conventional assumption that vision tokens must traverse all deep language-model layers, and indicate that a single late fusion layer can be sufficient for maintaining strong perceptual competence in LLaVA-style MLLMs.