小型视觉语言模型是长视频理解的智能压缩器

摘要

针对长视频适配多模态大语言模型（MLLMs）时，上下文长度限制成为主要瓶颈。密集的视觉流会耗尽令牌预算，加剧"中间信息丢失"现象。现有启发式方法（如稀疏采样或均匀池化）盲目牺牲保真度——既丢弃关键瞬间，又在无关背景上浪费带宽。我们提出Tempo框架，通过查询感知的高效压缩实现长视频下游理解。Tempo采用小型视觉语言模型（SVLM）作为局部时序压缩器，将令牌削减视为早期跨模态蒸馏过程，在单次前向传播中生成紧凑且意图对齐的表征。为在保持因果性的前提下执行严格预算，我们引入自适应令牌分配（ATA）机制。该训练无关的O(1)动态路由器利用SVLM的零样本相关性先验和语义前置特性，为查询关键片段分配密集带宽，同时将冗余内容压缩为最小化时序锚点以维持全局叙事。大量实验表明，我们的6B参数架构在激进动态压缩（0.5-16令牌/帧）下实现最优性能。在超长视频基准LVBench（4101秒）上，Tempo在8K视觉预算限制下获得52.3分，超越GPT-4o和Gemini 1.5 Pro。扩展至2048帧时分数达53.7。关键的是，Tempo将小时级视频压缩至理论极限以下，证明真正的长视频理解依赖于意图驱动的效率，而非贪婪填充上下文窗口。

English

Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free O(1) dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.

小型视觉语言模型是长视频理解的智能压缩器

Small Vision-Language Models are Smart Compressors for Long Video Understanding

摘要

Support