小型视觉语言模型是长视频理解的智能压缩器
Small Vision-Language Models are Smart Compressors for Long Video Understanding
April 9, 2026
作者: Junjie Fei, Jun Chen, Zechun Liu, Yunyang Xiong, Chong Zhou, Wei Wen, Junlin Han, Mingchen Zhuge, Saksham Suri, Qi Qian, Shuming Liu, Lemeng Wu, Raghuraman Krishnamoorthi, Vikas Chandra, Mohamed Elhoseiny, Chenchen Zhu
cs.AI
摘要
针对长视频适配多模态大语言模型(MLLMs)时,上下文长度限制成为主要瓶颈。密集的视觉流会耗尽令牌预算,加剧"中间信息丢失"现象。现有启发式方法(如稀疏采样或均匀池化)盲目牺牲保真度——既丢弃关键瞬间,又在无关背景上浪费带宽。我们提出Tempo框架,通过查询感知的高效压缩实现长视频下游理解。Tempo采用小型视觉语言模型(SVLM)作为局部时序压缩器,将令牌削减视为早期跨模态蒸馏过程,在单次前向传播中生成紧凑且意图对齐的表征。为在保持因果性的前提下执行严格预算,我们引入自适应令牌分配(ATA)机制。该训练无关的O(1)动态路由器利用SVLM的零样本相关性先验和语义前置特性,为查询关键片段分配密集带宽,同时将冗余内容压缩为最小化时序锚点以维持全局叙事。大量实验表明,我们的6B参数架构在激进动态压缩(0.5-16令牌/帧)下实现最优性能。在超长视频基准LVBench(4101秒)上,Tempo在8K视觉预算限制下获得52.3分,超越GPT-4o和Gemini 1.5 Pro。扩展至2048帧时分数达53.7。关键的是,Tempo将小时级视频压缩至理论极限以下,证明真正的长视频理解依赖于意图驱动的效率,而非贪婪填充上下文窗口。
English
Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free O(1) dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.