OmniZip:音频引导的动态令牌压缩技术——加速全模态大语言模型
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models
November 18, 2025
作者: Keda Tao, Kele Shao, Bohan Yu, Weiqiang Wang, Jian liu, Huan Wang
cs.AI
摘要
近期,全模态大语言模型(OmniLLMs)在统一音视频理解领域日益受到研究关注,然而处理音视频令牌序列会形成显著的计算瓶颈。现有令牌压缩方法尚未满足这一新兴的多模态令牌联合压缩需求。为填补此空白,我们提出OmniZip——一种无需训练、音频引导的音视频令牌压缩框架,可优化多模态令牌表征并加速推理。具体而言,OmniZip首先识别显著音频令牌,随后计算每个时间组的音频保留分数以捕捉信息密度,从而动态指导视频令牌剪枝,并通过跨模态相似性增强的音频锚点保留关键信息。针对每个时间窗口,OmniZip采用交错时空方案压缩视频令牌。大量实验结果表明OmniZip的优势:在无需训练的情况下保持性能的同时,相较其他顶尖方案实现了3.42倍推理加速和1.4倍内存缩减。
English
Omnimodal large language models (OmniLLMs) have attracted increasing research attention of late towards unified audio-video understanding, wherein processing audio-video token sequences creates a significant computational bottleneck, however. Existing token compression methods have yet to accommodate this emerging need of jointly compressing multimodal tokens. To bridge this gap, we present OmniZip, a training-free, audio-guided audio-visual token-compression framework that optimizes multimodal token representation and accelerates inference. Specifically, OmniZip first identifies salient audio tokens, then computes an audio retention score for each time group to capture information density, thereby dynamically guiding video token pruning and preserving cues from audio anchors enhanced by cross-modal similarity. For each time window, OmniZip compresses the video tokens using an interleaved spatio-temporal scheme. Extensive empirical results demonstrate the merits of OmniZip - it achieves 3.42X inference speedup and 1.4X memory reduction over other top-performing counterparts, while maintaining performance with no training.