AdaCodec: 비디오 MLLM을 위한 예측 시각 코드

초록

비디오는 시간적으로 중복성을 지닌다. 인접한 프레임은 일반적으로 대부분의 객체, 배경 및 레이아웃을 공유한다. 그러나 기존의 비디오 다중모달 거대 언어 모델(video MLLM)은 보통 각 샘플링된 프레임을 독립적인 RGB 이미지로 인코딩하여, 이전 프레임에 이미 존재하는 내용을 시각 토큰이 반복하게 만든다. 이는 보다 직접적인 비디오 인터페이스의 필요성을 시사한다. 즉, 이전 맥락에서 장면을 잘 예측할 수 없는 경우에만 전체 참조 프레임을 전송하고, 그 외에는 프레임 간 변화에 대한 간결한 설명을 전송하는 방식이다. 이러한 인터페이스를 예측 시각 코드(predictive visual code)라고 부르며, 이를 비디오 MLLM을 위한 AdaCodec으로 구현한다. AdaCodec은 조건부 예측 비용이 높은 경우에만 참조 프레임에 전체 시각 토큰을 할당하고, 그렇지 않은 경우에는 움직임 및 예측 잔차를 포함한 프레임 간 변화를 간결한 P-토큰(P-tokens)으로 인코딩한다. 총 11개 벤치마크에 걸쳐, AdaCodec은 일치된 시각 토큰 예산 하에서 Qwen3-VL-8B의 프레임별 RGB 기준선을 능가한다. 예산이 1/7에 불과한 경우에도, 32k 토큰을 사용하는 AdaCodec은 모든 장기 비디오 벤치마크에서 224k 기준선을 능가한다. 또한 5개의 일반 비디오 벤치마크에서는 평균 점수를 향상시키는 동시에 첫 토큰 생성 시간을 9.26초에서 1.62초로 크게 단축한다.

English

Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large language models (video MLLMs) usually encode each sampled frame as an independent RGB image, causing visual tokens to repeat content already present in earlier frames. This suggests a more direct video interface: send a full reference frame only when the scene cannot be predicted well from prior context, and otherwise transmit a compact description of inter-frame changes. We call this interface a predictive visual code, and instantiate it for video MLLMs as AdaCodec. AdaCodec spends full visual tokens on a reference frame only when its conditional predictive cost is high; otherwise, it encodes inter-frame changes, including motion and prediction residuals, as compact P-tokens. Across all eleven benchmarks, AdaCodec improves over the Qwen3-VL-8B per-frame RGB baseline at a matched visual-token budget. Even at 1/7 the budget, AdaCodec with 32k tokens surpasses the 224k baseline on all long-video benchmarks; on five general-video benchmarks, it raises the average score while substantially cutting time-to-first-token from 9.26s to 1.62s.