VideoAtlas: 対数計算量での長尺動画ナビゲーション

要旨

言語モデルを動画に拡張する際には、2つの課題が生じる。1つは表現に関する課題で、既存手法は非可逆的な近似に依存している。もう1つは長文脈に関する課題で、キャプションやエージェントベースのパイプラインは動画をテキストに変換するため視覚的忠実度が失われる。これらの課題を克服するため、我々はVideoAtlasを提案する。これはタスクに依存しない環境であり、動画を非可逆的でナビゲート可能、スケーラブル、かつキャプションや前処理を必要としない階層的なグリッドとして表現する。動画の概要は一瞥で把握でき、任意の領域を再帰的に拡大でき、同じ視覚的表現が動画本体、中間的な調査、エージェントの記憶に対して一貫して使用されるため、非可逆的なテキスト変換をエンドツーエンドで排除する。この階層構造により、アクセス深度は動画の長さに対して対数的にのみ増加する。長文脈に関しては、Recursive Language Models (RLMs) が長文テキストに対する有力な解決策を最近提供したが、視覚領域に拡張するには再帰的に潜入可能な構造化環境が必要であり、VideoAtlasはこれを提供する。マルコフ決定過程としてのVideoAtlasは、Video-RLMを可能にする。これは並列的なMaster-Workerアーキテクチャであり、Masterが大域的な探索を調整し、Workerが割り当てられた領域に並行して潜入し、非可逆的な視覚的証拠を蓄積する。我々は3つの重要な知見を実証する。(1) 動画の長さに対する計算量の対数的増加。これはグリッドの構造再利用から生じる30-60%のマルチモーダルキャッシュヒット率によってさらに増幅される。(2) 最大探索深度を制限することで、計算量と精度を調整する原理的なハイパーパラメータとなる環境予算制御。(3) 質問の粒度に応じてスケールする創発的な適応的計算リソース割り当て。1時間から10時間のベンチマークにスケールする際、Video-RLMは精度劣化が最小限で、最も持続時間に対する頑健性が高い手法であり続け、構造化された環境ナビゲーションが動画理解のための実行可能かつスケーラブルなパラダイムであることを示す。

English

Extending language models to video introduces two challenges: representation, where existing methods rely on lossy approximations, and long-context, where caption- or agent-based pipelines collapse video into text and lose visual fidelity. To overcome this, we introduce VideoAtlas, a task-agnostic environment to represent video as a hierarchical grid that is simultaneously lossless, navigable, scalable, caption- and preprocessing-free. An overview of the video is available at a glance, and any region can be recursively zoomed into, with the same visual representation used uniformly for the video, intermediate investigations, and the agent's memory, eliminating lossy text conversion end-to-end. This hierarchical structure ensures access depth grows only logarithmically with video length. For long-context, Recursive Language Models (RLMs) recently offered a powerful solution for long text, but extending them to visual domain requires a structured environment to recurse into, which VideoAtlas provides. VideoAtlas as a Markov Decision Process unlocks Video-RLM: a parallel Master-Worker architecture where a Master coordinates global exploration while Workers concurrently drill into assigned regions to accumulate lossless visual evidence. We demonstrate three key findings: (1)~logarithmic compute growth with video duration, further amplified by a 30-60\% multimodal cache hit rate arising from the grid's structural reuse. (2)~environment budgeting, where bounding the maximum exploration depth provides a principled compute-accuracy hyperparameter. (3)~emergent adaptive compute allocation that scales with question granularity. When scaling from 1-hour to 10-hour benchmarks, Video-RLM remains the most duration-robust method with minimal accuracy degradation, demonstrating that structured environment navigation is a viable and scalable paradigm for video understanding.

VideoAtlas: 対数計算量での長尺動画ナビゲーション

VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

要旨

Support