ChatPaper.aiChatPaper

小型视觉语言模型是长视频理解的智能压缩器

Small Vision-Language Models are Smart Compressors for Long Video Understanding

April 9, 2026
作者: Junjie Fei, Jun Chen, Zechun Liu, Yunyang Xiong, Chong Zhou, Wei Wen, Junlin Han, Mingchen Zhuge, Saksham Suri, Qi Qian, Shuming Liu, Lemeng Wu, Raghuraman Krishnamoorthi, Vikas Chandra, Mohamed Elhoseiny, Chenchen Zhu
cs.AI

摘要

针对长视频的多模态大语言模型(MLLMs)适配工作正面临上下文长度限制的瓶颈。密集的视觉流会快速耗尽令牌配额,并加剧"迷失在中段"的现象。现有启发式方法(如稀疏采样或均匀池化)盲目牺牲保真度——既丢弃关键决策时刻,又在无关背景上浪费带宽。我们提出Tempo这一高效的查询感知框架,通过压缩长视频来支持下游理解任务。Tempo采用小型视觉语言模型(SVLM)作为局部时序压缩器,将令牌缩减转化为早期跨模态蒸馏过程,在单次前向传播中生成紧凑且意图对齐的表征。为实现严格预算下的因果保持,我们引入自适应令牌分配(ATA)机制。该训练无关的O(1)动态路由器利用SVLM的零样本相关性先验和语义前置特性,为查询关键片段分配密集带宽,同时将冗余内容压缩为最小化的时序锚点以维持全局叙事线。大量实验表明,我们的60亿参数架构在激进动态压缩(0.5-16令牌/帧)下实现业界最优性能。在超长视频数据集LVBench(4101秒)上,Tempo在8K视觉令牌严格预算下获得52.3分,超越GPT-4o和Gemini 1.5 Pro。扩展至2048帧时分数达53.7。关键的是,Tempo将小时级视频压缩至理论极限以下,证明真正的长视频理解依赖于意图驱动的效率优化,而非贪婪填充的上下文窗口。
English
Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free O(1) dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.
PDF41April 11, 2026