AdaCodec：面向视频多模态大語言模型的預測性視覺編碼

摘要

视频在时间维度上存在冗余：相邻帧通常共享大部分物体、背景及布局。然而，现有的视频多模态大语言模型通常将每帧采样视作独立RGB图像编码，导致视觉标记重复出现先前帧已包含的内容。这表明需要更直接的视频交互方式：仅在场景无法根据先前上下文较好预测时传输完整参考帧，否则传输帧间变化的紧凑描述。我们将这种交互方式称为预测视觉编码，并针对视频多模态大语言模型将其实例化为AdaCodec。只有当条件预测代价较高时，AdaCodec才会对参考帧分配完整视觉标记；否则，它会将包含运动信息和预测残差的帧间变化编码为紧凑的P标记。在全部十一个基准测试中，AdaCodec在匹配视觉标记预算的条件下，较Qwen3-VL-8B逐帧RGB基线表现更优。即使在七分之一的预算下（32k标记），AdaCodec在所有长视频基准测试中仍超越224k基线；在五个通用视频基准测试中，它提升平均得分的同时，将首字生成时间从9.26秒大幅缩短至1.62秒。

English

Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large language models (video MLLMs) usually encode each sampled frame as an independent RGB image, causing visual tokens to repeat content already present in earlier frames. This suggests a more direct video interface: send a full reference frame only when the scene cannot be predicted well from prior context, and otherwise transmit a compact description of inter-frame changes. We call this interface a predictive visual code, and instantiate it for video MLLMs as AdaCodec. AdaCodec spends full visual tokens on a reference frame only when its conditional predictive cost is high; otherwise, it encodes inter-frame changes, including motion and prediction residuals, as compact P-tokens. Across all eleven benchmarks, AdaCodec improves over the Qwen3-VL-8B per-frame RGB baseline at a matched visual-token budget. Even at 1/7 the budget, AdaCodec with 32k tokens surpasses the 224k baseline on all long-video benchmarks; on five general-video benchmarks, it raises the average score while substantially cutting time-to-first-token from 9.26s to 1.62s.