CoPE-VideoLM：面向高效视频语言模型的编解码基元

摘要

视频语言模型（VideoLM）使人工智能系统能够理解视频中的时序动态。为适应最大上下文窗口的限制，现有方法采用关键帧采样技术，但由于时间覆盖稀疏，可能同时遗漏宏观事件与微观细节。此外，对每帧完整图像及其令牌进行处理会带来巨大计算开销。为解决这些局限性，我们提出利用视频编解码原语（特别是运动向量与残差），这些原生特性能够在不需对多数帧进行昂贵全图像编码的情况下，自然表征视频冗余性与稀疏性。为此，我们引入了基于轻量级Transformer的编码器，通过预训练策略聚合编解码原语并使其表征与图像编码器嵌入对齐，从而加速端到端微调时的收敛速度。相比标准VideoLM，我们的方法将首令牌生成时间缩短最高86%，令牌使用量减少最高93%。此外，通过调节关键帧与编解码原语密度，我们在涵盖通用问答、时序推理、长视频理解及空间场景理解等14个多样化视频理解基准测试中保持甚至超越了原有性能。

English

Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to 86% and token usage by up to 93% compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on 14 diverse video understanding benchmarks spanning general question answering, temporal reasoning, long-form understanding, and spatial scene understanding.

CoPE-VideoLM：面向高效视频语言模型的编解码基元

CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

摘要

Support