階段自適應令牌選擇用於高效全模态大語言模型

摘要

全模态大語言模型（om-LLMs）透過將視訊與音訊編碼為在視窗層級交錯排列的時序對齊令牌序列，實現統一的音視覺理解。然而，在整個大語言模型中處理這些密集的非文字令牌會產生巨大的計算負擔。雖然免訓練的令牌選擇方法能降低此成本，但現有方法若非僅專注於視覺輸入，就是在進入大語言模型前以固定的每模態比率修剪 om-LLMs 的令牌，未能捕捉跨模態令牌重要性在各層之間的變化。為解決此限制，我們首先分析 om-LLMs 的逐層令牌相依性。我們發現視覺與音訊的相依性呈現區塊式模式，並隨深度漸減，這表示許多深層的非文字令牌在跨模態融合後變得冗餘。受此觀察啟發，我們提出 SEATS，一種免訓練、階段自適應的令牌選擇方法，以實現高效的 om-LLM 推論。在進入大語言模型前，SEATS 透過注意力加權多樣性選擇移除時空冗餘。在大語言模型內部，它逐步在區塊間修剪令牌，並利用查詢相關性分數將保留預算從時間視窗動態分配給各模態。在後期層中，一旦跨模態融合完成，它便移除所有剩餘的非文字令牌。在 Qwen2.5-Omni 與 Qwen3-Omni 上的實驗證明，SEATS 能有效提升推論效率。僅保留 10% 的視覺與音訊令牌，即可實現 9.3 倍的浮點運算次數降低與 4.8 倍的預填充加速，同時維持 96.3% 的原始性能。

English

Omni-modal large language models (om-LLMs) achieve unified audio-visual understanding by encoding video and audio into temporally aligned token sequences interleaved at the window level. However, processing these dense non-textual tokens throughout the LLM incurs substantial computational overhead. Although training-free token selection can reduce this cost, existing methods either focus on visual-only inputs or prune om-LLM tokens only before the LLM with fixed per-modality ratios, failing to capture how cross-modal token importance evolves across layers. To address this limitation, we first analyze the layer-wise token dependency of om-LLMs. We find that visual and audio dependencies follow a block-wise pattern and gradually weaken with depth, indicating that many late-layer non-textual tokens become redundant after cross-modal fusion. Motivated by this observation, we propose SEATS, a training-free, stage-adaptive token selection method for efficient om-LLM inference. Before the LLM, SEATS removes spatiotemporal redundancy via attention-weighted diversity selection. Inside the LLM, it progressively prunes tokens across blocks and dynamically allocates the retention budget from temporal windows to modalities using query relevance scores. In late layers, it removes all remaining non-textual tokens once cross-modal fusion is complete. Experiments on Qwen2.5-Omni and Qwen3-Omni demonstrate that SEATS effectively improves inference efficiency. Retaining only 10% of visual and audio tokens, it achieves a 9.3x FLOPs reduction and a 4.8x prefill speedup while preserving 96.3% of the original performance.