原生主動感知作為全模態理解的推理

摘要

长视频理解中的被动模型通常依赖“全量观看”范式，无论查询难度如何均统一处理所有帧，导致计算成本随视频时长增长。尽管交互式框架已出现，但它们往往依赖全局预扫描，其上下文成本仍与视频长度成正比。我们提出OmniAgent——首个原生全模态智能体，将视频理解建模为基于POMDP的迭代式“观察-思考-行动”循环。OmniAgent按需执行动作，选择性将视听线索提炼为持久文本记忆，从而有效解耦推理复杂度与原始视频时长。为实现这一目标，我们引入：（1）智能体监督微调，通过最佳轨迹合成及双阶段质量控制引导原生主动感知；（2）基于TAURA（回合感知自适应不确定性重缩放优势）的智能体强化学习，利用回合级熵引导信用分配聚焦于关键发现回合。关键在于，OmniAgent展现出正向测试时缩放特性——随着推理回合数增加，性能持续提升，验证了主动感知的有效性。在十个基准（如VideoMME、LVBench）上的实证结果表明，OmniAgent在开源模型中达到最先进性能。值得注意的是，在LVBench上，我们的7B智能体以50.5%对47.3%的成绩超越了规模大10倍的Qwen2.5-VL-72B。

English

Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-of-N trajectory synthesis with dual-stage quality control, and (2) Agentic Reinforcement Learning with TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage), which leverages turn-level entropy to steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy of active perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10times larger Qwen2.5-VL-72B (50.5% vs. 47.3%).