CoPE-VideoLM：面向高效影片語言模型的編解碼器基元

摘要

影片語言模型（VideoLMs）使AI系統能夠理解影片中的時序動態。為適應最大上下文窗口的限制，現有方法採用關鍵幀取樣技術，但由於時序覆蓋稀疏，可能遺漏宏觀層級的事件與微觀層級的細節。此外，對每幀完整影像及其標記進行處理會產生大量計算開銷。為解決這些局限性，我們提出利用影片編解碼器原生元素（特別是運動向量與殘差），這些元素能自然編碼影片的冗餘性與稀疏性，無需對多數幀進行昂貴的完整影像編碼。為此，我們引入基於輕量級Transformer的編碼器，透過預訓練策略聚合編解碼器元素並將其表徵與影像編碼器嵌入對齊，從而加速端到端微調時的收斂速度。相較於標準影片語言模型，我們的方法將「首標記生成時間」縮短達86%，標記使用量減少達93%。更重要的是，透過調整關鍵幀與編解碼器元素的密度，我們在涵蓋通用問答、時序推理、長影片理解及空間場景理解等14項多元影片理解基準測試中，均能維持甚至超越原有性能表現。

English

Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to 86% and token usage by up to 93% compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on 14 diverse video understanding benchmarks spanning general question answering, temporal reasoning, long-form understanding, and spatial scene understanding.

CoPE-VideoLM：面向高效影片語言模型的編解碼器基元

CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

摘要

Support