ChatPaper.aiChatPaper

OmniSIFT:面向高效全模態大語言模型的模態非對稱令牌壓縮技術

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

February 4, 2026
作者: Yue Ding, Yiyan Ji, Jungang Li, Xuyang Liu, Xinlong Chen, Junfei Wu, Bozhou Li, Bohan Zeng, Yang Shi, Yushuo Guan, Yuanxing Zhang, Jiaheng Liu, Qiang Liu, Pengfei Wan, Liang Wang
cs.AI

摘要

全模態大型語言模型(Omni-LLMs)在音視頻理解任務中展現出強大能力,但其對長序列多模態標記的依賴導致顯著計算開銷。儘管存在此挑戰,專為Omni-LLMs設計的標記壓縮方法仍十分有限。為此,我們提出OmniSIFT(全模態時空感知細粒度標記壓縮),一種專為Omni-LLMs設計的模態非對稱標記壓縮框架。具體而言,OmniSIFT採用兩階段壓縮策略:(1)時空視頻剪枝模塊,消除由幀內結構與幀間重疊產生的視頻冗餘;(2)視覺引導音頻篩選模塊,過濾音頻標記。整個框架通過可微分直通估計器進行端到端優化。在五個代表性基準測試上的大量實驗證明了OmniSIFT的有效性與魯棒性。值得注意的是,對於Qwen2.5-Omni-7B模型,OmniSIFT僅引入485萬參數,同時保持比OmniZip等無訓練基線更低的延遲。僅需原始標記上下文25%的條件下,OmniSIFT持續優於所有壓縮基線,並在多項任務中甚至超越全標記模型的性能。
English
Omni-modal Large Language Models (Omni-LLMs) have demonstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial computational overhead. Despite this challenge, token compression methods designed for Omni-LLMs remain limited. To bridge this gap, we propose OmniSIFT (Omni-modal Spatio-temporal Informed Fine-grained Token compression), a modality-asymmetric token compression framework tailored for Omni-LLMs. Specifically, OmniSIFT adopts a two-stage compression strategy: (i) a spatio-temporal video pruning module that removes video redundancy arising from both intra-frame structure and inter-frame overlap, and (ii) a vision-guided audio selection module that filters audio tokens. The entire framework is optimized end-to-end via a differentiable straight-through estimator. Extensive experiments on five representative benchmarks demonstrate the efficacy and robustness of OmniSIFT. Notably, for Qwen2.5-Omni-7B, OmniSIFT introduces only 4.85M parameters while maintaining lower latency than training-free baselines such as OmniZip. With merely 25% of the original token context, OmniSIFT consistently outperforms all compression baselines and even surpasses the performance of the full-token model on several tasks.
PDF411February 6, 2026