AdaCodec：面向视频多模态大语言模型的预测性视觉编码

摘要

视频存在时序冗余：相邻帧通常共享大部分物体、背景和布局。然而，现有的视频多模态大语言模型（视频MLLMs）通常将每个采样帧编码为独立的RGB图像，导致视觉令牌重复包含前一帧已有的内容。这表明存在一种更直接的视频交互方式：仅在场景无法通过先前上下文较好预测时发送完整参考帧，否则传输帧间变化的紧凑描述。我们将这种交互方式命名为预测性视觉编码，并将其在视频MLLMs中具体实现为AdaCodec。AdaCodec仅在条件预测代价较高时为参考帧分配完整视觉令牌；反之，它则将帧间变化（包括运动信息和预测残差）编码为紧凑的P令牌。在全部11个基准测试中，AdaCodec在匹配的视觉令牌预算下，均优于基于Qwen3-VL-8B逐帧RGB的基线模型。即使在1/7的预算下，使用32k令牌的AdaCodec在所有长视频基准测试中仍超越224k基线的表现；在五项通用视频基准测试中，它平均分数提升的同时，将首令牌延迟从9.26秒显著缩短至1.62秒。

English

Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large language models (video MLLMs) usually encode each sampled frame as an independent RGB image, causing visual tokens to repeat content already present in earlier frames. This suggests a more direct video interface: send a full reference frame only when the scene cannot be predicted well from prior context, and otherwise transmit a compact description of inter-frame changes. We call this interface a predictive visual code, and instantiate it for video MLLMs as AdaCodec. AdaCodec spends full visual tokens on a reference frame only when its conditional predictive cost is high; otherwise, it encodes inter-frame changes, including motion and prediction residuals, as compact P-tokens. Across all eleven benchmarks, AdaCodec improves over the Qwen3-VL-8B per-frame RGB baseline at a matched visual-token budget. Even at 1/7 the budget, AdaCodec with 32k tokens surpasses the 224k baseline on all long-video benchmarks; on five general-video benchmarks, it raises the average score while substantially cutting time-to-first-token from 9.26s to 1.62s.