OmniSIFT:面向高效全模态大语言模型的模态非对称令牌压缩
OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models
February 4, 2026
作者: Yue Ding, Yiyan Ji, Jungang Li, Xuyang Liu, Xinlong Chen, Junfei Wu, Bozhou Li, Bohan Zeng, Yang Shi, Yushuo Guan, Yuanxing Zhang, Jiaheng Liu, Qiang Liu, Pengfei Wan, Liang Wang
cs.AI
摘要
全模态大语言模型(Omni-LLMs)在音视频理解任务中展现出强大能力,但其对长序列多模态令牌的依赖导致显著的计算开销。尽管存在这一挑战,针对Omni-LLMs的令牌压缩方法仍较为有限。为填补这一空白,我们提出OmniSIFT(全模态时空感知细粒度令牌压缩)——一种专为Omni-LLMs设计的模态非对称令牌压缩框架。具体而言,OmniSIFT采用两阶段压缩策略:(i)时空视频剪枝模块,消除由帧内结构和帧间重叠产生的视频冗余;(ii)视觉引导的音频选择模块,过滤音频令牌。整个框架通过可微分直通估计器进行端到端优化。在五个代表性基准测试上的大量实验证明了OmniSIFT的有效性与鲁棒性。值得注意的是,对于Qwen2.5-Omni-7B模型,OmniSIFT仅引入485万个参数,同时保持比OmniZip等无训练基线更低的延迟。在仅使用原始令牌上下文25%的情况下,OmniSIFT持续优于所有压缩基线,并在多项任务中超越全令牌模型的性能。
English
Omni-modal Large Language Models (Omni-LLMs) have demonstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial computational overhead. Despite this challenge, token compression methods designed for Omni-LLMs remain limited. To bridge this gap, we propose OmniSIFT (Omni-modal Spatio-temporal Informed Fine-grained Token compression), a modality-asymmetric token compression framework tailored for Omni-LLMs. Specifically, OmniSIFT adopts a two-stage compression strategy: (i) a spatio-temporal video pruning module that removes video redundancy arising from both intra-frame structure and inter-frame overlap, and (ii) a vision-guided audio selection module that filters audio tokens. The entire framework is optimized end-to-end via a differentiable straight-through estimator. Extensive experiments on five representative benchmarks demonstrate the efficacy and robustness of OmniSIFT. Notably, for Qwen2.5-Omni-7B, OmniSIFT introduces only 4.85M parameters while maintaining lower latency than training-free baselines such as OmniZip. With merely 25% of the original token context, OmniSIFT consistently outperforms all compression baselines and even surpasses the performance of the full-token model on several tasks.