小型视觉语言模型是长视频理解的智能压缩器

摘要

针对长视频的多模态大语言模型（MLLMs）适配工作正面临上下文长度限制的瓶颈。密集的视觉流会快速耗尽令牌配额，并加剧"迷失在中段"的现象。现有启发式方法（如稀疏采样或均匀池化）盲目牺牲保真度——既丢弃关键决策时刻，又在无关背景上浪费带宽。我们提出Tempo这一高效的查询感知框架，通过压缩长视频来支持下游理解任务。Tempo采用小型视觉语言模型（SVLM）作为局部时序压缩器，将令牌缩减转化为早期跨模态蒸馏过程，在单次前向传播中生成紧凑且意图对齐的表征。为实现严格预算下的因果保持，我们引入自适应令牌分配（ATA）机制。该训练无关的O(1)动态路由器利用SVLM的零样本相关性先验和语义前置特性，为查询关键片段分配密集带宽，同时将冗余内容压缩为最小化的时序锚点以维持全局叙事线。大量实验表明，我们的60亿参数架构在激进动态压缩（0.5-16令牌/帧）下实现业界最优性能。在超长视频数据集LVBench（4101秒）上，Tempo在8K视觉令牌严格预算下获得52.3分，超越GPT-4o和Gemini 1.5 Pro。扩展至2048帧时分数达53.7。关键的是，Tempo将小时级视频压缩至理论极限以下，证明真正的长视频理解依赖于意图驱动的效率优化，而非贪婪填充的上下文窗口。

English

Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free O(1) dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.

小型视觉语言模型是长视频理解的智能压缩器

Small Vision-Language Models are Smart Compressors for Long Video Understanding

摘要

Support