オムニモーダル理解のための推論としてのネイティブ能動的知覚

要旨

長編動画理解のための受動的モデルは、通常「全てを見る」パラダイムに依存し、クエリの難易度にかかわらずフレームを一様に処理するため、計算コストが動画の長さに比例して増大する。インタラクティブフレームワークが登場しているものの、多くの場合、グローバルな事前スキャンに依存しており、そのコンテキストコストは依然として動画の長さに応じて拡大する。我々は、動画理解をPOMDPに基づく反復的な観察・思考・行動サイクルとして定式化する、初のネイティブ全モーダルエージェントであるOmniAgentを提案する。OmniAgentは、オンデマンドな行動を実行して音声・視覚的手がかりを選択的にテキストベースの永続的メモリに抽出し、推論の複雑性を動画の生の長さから効果的に切り離す。これを実現するために、(1) ネイティブな能動的知覚をブートストラップするためのエージェンティック教師ありファインチューニング（Best-of-N軌道合成と二段階品質制御を用いる）、および(2) TAURA（Turn-aware Adaptive Uncertainty Rescaled Advantage）を用いたエージェンティック強化学習を導入する。TAURAは、ターンレベルのエントロピーを活用して、重要な発見ターンへのクレジット割り当てを導く。重要な点として、OmniAgentは正のテスト時スケーリングを示し、推論ターン数が増加するにつれて性能が向上するため、能動的知覚の有効性が確認される。10個のベンチマーク（例：VideoMME、LVBench）での実験結果は、OmniAgentがオープンソースモデルの中で最先端の性能を達成することを示している。特筆すべきは、LVBenchにおいて、我々の7Bエージェントが10倍大きいQwen2.5-VL-72B（50.5％対47.3％）を上回った点である。

English

Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-of-N trajectory synthesis with dual-stage quality control, and (2) Agentic Reinforcement Learning with TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage), which leverages turn-level entropy to steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy of active perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10times larger Qwen2.5-VL-72B (50.5% vs. 47.3%).