AdaCodec: 動画MLLMのための予測型視覚符号

要旨

動画には時間的な冗長性がある。隣接するフレームは通常、ほとんどのオブジェクト、背景、およびレイアウトを共有する。しかし、既存の動画マルチモーダル大規模言語モデル（動画MLLM）は通常、サンプリングされた各フレームを独立したRGB画像として符号化するため、以前のフレームに既に存在するコンテンツが視覚トークン内で繰り返されることになる。このことから、より直接的な動画インターフェースが示唆される。すなわち、シーンが先行コンテキストから十分に予測できない場合にのみ完全な参照フレームを送信し、それ以外の場合はフレーム間の変化のコンパクトな記述を送信するというものである。我々はこのインターフェースを予測型視覚コードと呼び、動画MLLM向けにAdaCodecとして具現化する。AdaCodecは、条件付き予測コストが高い場合にのみ、参照フレームに完全な視覚トークンを割り当てる。それ以外の場合は、動きや予測残差を含むフレーム間の変化を、コンパクトなPトークンとして符号化する。全11のベンチマークにおいて、AdaCodecはマッチした視覚トークン予算で、Qwen3-VL-8Bのフレーム単位RGBベースラインを上回る。予算が7分の1であっても、AdaCodecは32kトークンで、すべての長尺動画ベンチマークにおいて224kのベースラインを凌駕する。5つの一般動画ベンチマークでは、平均スコアを向上させつつ、初回トークン出力までの時間を9.26秒から1.62秒へと大幅に短縮する。

English

Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large language models (video MLLMs) usually encode each sampled frame as an independent RGB image, causing visual tokens to repeat content already present in earlier frames. This suggests a more direct video interface: send a full reference frame only when the scene cannot be predicted well from prior context, and otherwise transmit a compact description of inter-frame changes. We call this interface a predictive visual code, and instantiate it for video MLLMs as AdaCodec. AdaCodec spends full visual tokens on a reference frame only when its conditional predictive cost is high; otherwise, it encodes inter-frame changes, including motion and prediction residuals, as compact P-tokens. Across all eleven benchmarks, AdaCodec improves over the Qwen3-VL-8B per-frame RGB baseline at a matched visual-token budget. Even at 1/7 the budget, AdaCodec with 32k tokens surpasses the 224k baseline on all long-video benchmarks; on five general-video benchmarks, it raises the average score while substantially cutting time-to-first-token from 9.26s to 1.62s.