面向高效全模态大语言模型的阶段自适应令牌选择

摘要

全模态大语言模型通过将视频和音频编码为按窗口级交错的时间对齐token序列，实现了统一的视听理解。然而，在整个大语言模型中处理这些密集的非文本token会带来巨大的计算开销。虽然免训练token选择可以降低这一成本，但现有方法要么仅关注纯视觉输入，要么仅在进入大语言模型前按固定模态比例剪除非文本token，未能捕捉跨模态token重要性随层数演变的规律。为解决这一局限，我们首先分析了全模态大语言模型的逐层token依赖关系。研究发现，视觉和音频依赖呈现块状模式，且随深度增加逐渐减弱，这表明许多深层非文本token在跨模态融合后变得冗余。基于这一发现，我们提出SEATS——一种免训练、阶段自适应的token选择方法，用于高效的全模态大语言模型推理。在大语言模型之前，SEATS通过注意力加权多样性选择消除时空冗余。在大语言模型内部，它逐块渐进式剪枝token，并利用查询相关性得分将保留预算从时间窗口动态分配到各模态。在深层，一旦跨模态融合完成，它将移除所有剩余的非文本token。在Qwen2.5-Omni和Qwen3-Omni上的实验表明，SEATS有效提升了推理效率。仅保留10%的视觉和音频token，即可实现9.3倍的FLOPs降低和4.8倍的预填充加速，同时保持96.3%的原始性能。

English

Omni-modal large language models (om-LLMs) achieve unified audio-visual understanding by encoding video and audio into temporally aligned token sequences interleaved at the window level. However, processing these dense non-textual tokens throughout the LLM incurs substantial computational overhead. Although training-free token selection can reduce this cost, existing methods either focus on visual-only inputs or prune om-LLM tokens only before the LLM with fixed per-modality ratios, failing to capture how cross-modal token importance evolves across layers. To address this limitation, we first analyze the layer-wise token dependency of om-LLMs. We find that visual and audio dependencies follow a block-wise pattern and gradually weaken with depth, indicating that many late-layer non-textual tokens become redundant after cross-modal fusion. Motivated by this observation, we propose SEATS, a training-free, stage-adaptive token selection method for efficient om-LLM inference. Before the LLM, SEATS removes spatiotemporal redundancy via attention-weighted diversity selection. Inside the LLM, it progressively prunes tokens across blocks and dynamically allocates the retention budget from temporal windows to modalities using query relevance scores. In late layers, it removes all remaining non-textual tokens once cross-modal fusion is complete. Experiments on Qwen2.5-Omni and Qwen3-Omni demonstrate that SEATS effectively improves inference efficiency. Retaining only 10% of visual and audio tokens, it achieves a 9.3x FLOPs reduction and a 4.8x prefill speedup while preserving 96.3% of the original performance.