小型視覚言語モデルは長尺動画理解のための効率的な圧縮技術である

要旨

長時間動画へのマルチモーダル大規模言語モデル（MLLM）の適応は、コンテキスト長の制約がボトルネックとなっている。高密度な視覚ストリームはトークン予算を飽和させ、「lost-in-the-middle」現象を悪化させる。既存のヒューリスティックな手法（疎サンプリングや均一プーリングなど）は、決定的瞬間を捨て、無関係な背景に帯域幅を浪費することで、忠実性を盲目的に犠牲にしている。本論文では、下流の理解タスクに向けて長時間動画を圧縮する、効率的なクエリ認識型フレームワーク「Tempo」を提案する。Tempoは小型視覚言語モデル（SVLM）を局所的時間圧縮器として利用し、トークン削減を早期のクロスモーダル蒸留プロセスとして位置づけることで、単一のフォワードパスでコンパクトかつ意図に沿った表現を生成する。因果性を損なわずに厳密な予算制約を課すため、適応的トークン割り当て（ATA）を導入する。ATAは、SVLMが持つゼロショット関連性事前知識と意味的前倒し特性を活用し、訓練不要のO(1)動的ルータとして機能する。これにより、クエリにとって重要なセグメントには高密度な帯域幅を割り当て、冗長部分は最小限の時間的アンカーに圧縮し、大域的なストーリー展開を維持する。大規模な実験により、6Bパラメータの当該アーキテクチャが、積極的な動的圧縮（0.5-16トークン/フレーム）において最先端の性能を達成することを示す。極長動画データセットLVBench（4101秒）において、Tempoは厳密な8K視覚トークン予算下で52.3のスコアを達成し、GPT-4oおよびGemini 1.5 Proを上回った。2048フレームへのスケーリングでは53.7に達する。重要な点は、Tempoが理論限界を大幅に下回る圧縮で長時間動画を処理し、真の長時間動画理解が、意図駆動型の効率性に依存し、コンテキストウィンドウを貪欲に拡張することではないことを実証したことである。

English

Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free O(1) dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.

小型視覚言語モデルは長尺動画理解のための効率的な圧縮技術である

Small Vision-Language Models are Smart Compressors for Long Video Understanding

要旨

Support