Fase-adaptieve tokenselectie voor efficiënte omnimodale LLMs

Samenvatting

Omni-modale grote taalmodellen (om-LLM's) realiseren een uniform audiovisueel begrip door video en audio te coderen in temporeel gealigneerde tokenreeksen die op vensterniveau worden afgewisseld. Het verwerken van deze dichte niet-tekstuele tokens in de gehele LLM brengt echter aanzienlijke computationele overhead met zich mee. Hoewel trainingsvrije tokenselectie deze kosten kan verlagen, richten bestaande methoden zich óf uitsluitend op visuele invoer, óf verwijderen ze tokens uit om-LLM's alleen vóór de LLM met vaste per-modaliteit-ratio's, zonder te vatten hoe crossmodaal tokenbelang over lagen heen evolueert. Om deze beperking aan te pakken, analyseren we eerst de laagsgewijze tokenafhankelijkheid van om-LLM's. We ontdekken dat visuele en audio-afhankelijkheden een bloksgewijs patroon volgen en geleidelijk verzwakken met de diepte, wat erop wijst dat veel late-laag niet-tekstuele tokens redundant worden na crossmodale fusie. Gemotiveerd door deze observatie stellen we SEATS voor, een trainingsvrije, fase-adaptieve tokenselectiemethode voor efficiënte om-LLM-inferentie. Vóór de LLM verwijdert SEATS spatiotemporele redundantie via aandachtsgewogen diversiteitsselectie. Binnenin de LLM snoeit het progressief tokens over blokken heen en wijst het het retentiebudget dynamisch toe van temporele vensters aan modaliteiten, met behulp van queryrelevantiescores. In late lagen verwijdert het alle resterende niet-tekstuele tokens zodra de crossmodale fusie is voltooid. Experimenten op Qwen2.5-Omni en Qwen3-Omni tonen aan dat SEATS de inferentie-efficiëntie effectief verbetert. Door slechts 10% van de visuele en audiotokens te behouden, wordt een 9,3× FLOPs-reductie en een 4,8× prefill-versnelling bereikt, terwijl 96,3% van de oorspronkelijke prestaties behouden blijft.

English

Omni-modal large language models (om-LLMs) achieve unified audio-visual understanding by encoding video and audio into temporally aligned token sequences interleaved at the window level. However, processing these dense non-textual tokens throughout the LLM incurs substantial computational overhead. Although training-free token selection can reduce this cost, existing methods either focus on visual-only inputs or prune om-LLM tokens only before the LLM with fixed per-modality ratios, failing to capture how cross-modal token importance evolves across layers. To address this limitation, we first analyze the layer-wise token dependency of om-LLMs. We find that visual and audio dependencies follow a block-wise pattern and gradually weaken with depth, indicating that many late-layer non-textual tokens become redundant after cross-modal fusion. Motivated by this observation, we propose SEATS, a training-free, stage-adaptive token selection method for efficient om-LLM inference. Before the LLM, SEATS removes spatiotemporal redundancy via attention-weighted diversity selection. Inside the LLM, it progressively prunes tokens across blocks and dynamically allocates the retention budget from temporal windows to modalities using query relevance scores. In late layers, it removes all remaining non-textual tokens once cross-modal fusion is complete. Experiments on Qwen2.5-Omni and Qwen3-Omni demonstrate that SEATS effectively improves inference efficiency. Retaining only 10% of visual and audio tokens, it achieves a 9.3x FLOPs reduction and a 4.8x prefill speedup while preserving 96.3% of the original performance.