Laat-laagfusie is voldoende: Duale-pad visuele tokenroutering voor multimodale grote taalmodellen onder visuele verzadiging

Samenvatting

Multimodale grote taalmodellen (MLLMs) erven doorgaans de diepe, symmetrische Transformer-ruggegraat die ontworpen is voor unimodale tekstmodellering, en passen dezelfde berekening uniform toe op afbeeldings- en taaltokens. Dit ontwerp gaat voorbij aan een belangrijke modaliteitsasymmetrie: afbeeldings- en teksttokens verschillen aanzienlijk in informatiedichtheid, redundantie en vereiste redeneerdiepte. Door een laagsgewijze analyse van LLaVA-1.5 observeren we dat visietokens de neiging hebben te verzadigen in de middelste lagen. Specifiek neemt de tekst-naar-beeld aandacht af van 0,68 in laag 0 tot 0,07 in laag 4, en stabiliseert rond 0,04 na laag 18, terwijl teksttokens blijven profiteren van diepe semantische verwerking. Deze bevindingen wijzen op een mismatch tussen architectonische symmetrie en diepte-asynchrone modaliteitsevolutie, wat resulteert in redundante visuele berekening en mogelijke drift in perceptuele representaties tijdens diepe taakspecifieke aanpassing. Gemotiveerd door dit voorstellen we Dual-Path Vision Token Routing (DPVR), een modaliteitsasymmetrisch routeringsraamwerk voor efficiënte MLLMs. De kernimplementatie, DPVR-LF (Late-Laag Fusie), routeert visietokens op het verzadigingspunt naar een trainbare zijtak van één laag, voert een dertienlaagse tekst-only forward uit die afbeeldingsposities in de diepe stapel overslaat, en voegt de visuele en tekstuele stromen pas in de laatste laag opnieuw samen. Met ongeveer 3% trainbare parameters behoudt DPVR-LF competitieve multimodale prestaties op standaard benchmarks, terwijl de visuele berekening in de diepe Transformer-stapel wordt verminderd. De resultaten dagen de conventionele aanname uit dat visietokens alle diepe taalmodellagen moeten doorlopen, en geven aan dat een enkele late fusielaag voldoende kan zijn voor het behouden van sterke perceptuele competentie in LLaVA-achtige MLLMs.

English

Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a key modality asymmetry: image and text tokens differ substantially in information density, redundancy, and required reasoning depth. Through a layer-wise analysis of LLaVA-1.5, we observe that vision tokens tend to saturate in the middle layers. Specifically, text-to-image attention decreases from 0.68 at layer 0 to 0.07 by layer 4, and stabilizes near 0.04 after layer 18, whereas text tokens continue to benefit from deep semantic processing. These findings suggest a mismatch between architectural symmetry and depth-asynchronous modality evolution, resulting in redundant visual computation and possible drift in perceptual representations during deep task-specific adaptation. Motivated by this, we propose Dual-Path Vision Token Routing (DPVR), a modality-asymmetric routing framework for efficient MLLMs. Its core instantiation, DPVR-LF (Late-Layer Fusion), routes vision tokens at the saturation point into a one-layer trainable side branch, runs a thirteen-layer text-only forward that skips image positions in the deep stack, and re-fuses the visual and textual streams only at the final layer. With approximately 3% trainable parameters, DPVR-LF preserves competitive multimodal performance on standard benchmarks while reducing visual computation in the deep Transformer stack. The results challenge the conventional assumption that vision tokens must traverse all deep language-model layers, and indicate that a single late fusion layer can be sufficient for maintaining strong perceptual competence in LLaVA-style MLLMs.