LiteFrame: 効率的な視覚エンコーダがVideo LLMにおけるフレームスケーリングを可能にする

要旨

長編動画に対応する大規模動画言語モデル（Video LLM）のスケーリングにおける根本的な課題は、ビジュアルトークンのコンテキスト長の爆発的な増加を管理することにある。既存のアプローチの多くは、特徴抽出後にビジュアルトークンを削減し、LLMの計算負荷を軽減する「事後的な」トークン削減に重点を置いている。これらの手法はビジュアルトークンの数を効果的に削減するものの、主要なレイテンシボトルネックがLLMから、視覚エンコーダによる高コストなフレーム単位の処理へと移行するという問題が生じる。この課題に対処するため、我々はVideo LLM向けの強力かつ高効率なビデオエンコーダ基盤であるLiteFrameを提案する。LiteFrameを学習するために、我々は「圧縮トークン蒸留（Compressed Token Distillation、CTD）」という新たな学習フレームワークを導入する。これは、コンパクトな学生視覚エンコーダが、大規模な教師視覚モデルによって生成された情報密度が高く時空間的に圧縮された表現を直接予測するように学習させ、冗長な計算を効果的に回避する手法である。さらに言語モデル適応（LMA）と組み合わせることで、新たなレイテンシと精度のパレート最適フロンティアが実現される。InternVL3-8Bと比較して、LiteFrameは8倍のフレームを処理しながらエンドツーエンドのレイテンシを35%削減し、複数のベンチマークにおいて動画理解の平均精度を向上させる。本結果は、固定された計算リソースの下で、より長尺の動画理解を実現する新たな可能性を示すものである。

English

The fundamental challenge in scaling Video Large Language Models (Video LLMs) to long-form video lies in managing the explosion of visual-token context length. Existing strategies predominantly focus on "post-hoc" token reduction -- reducing visual tokens after feature extraction to alleviate the LLM's computational overhead. While these methods effectively reduce the number of visual tokens, we observe that the primary latency bottleneck then shifts from the LLM to the expensive per-frame processing of the vision encoder. To address this, we introduce LiteFrame, a strong, yet highly efficient video encoder backbone for Video LLMs. To train LiteFrame, we propose Compressed Token Distillation (CTD), a novel training framework that teaches a compact student vision encoder to directly predict information-dense, spatio-temporally compressed representations produced by a large teacher vision model, effectively bypassing redundant computation. When coupled with further Language Model Adaptation (LMA), this approach results in a new latency-accuracy Pareto frontier -- compared with InternVL3-8B, LiteFrame provides a 35% reduction in end-to-end latency while processing 8times more frames and improves average video understanding accuracy across multiple benchmarks. Our results demonstrate a new potential path to unlocking longer-form video understanding under fixed compute budgets.