옴니모달 이해를 위한 추론으로서의 고유 능동 지각

초록

긴 비디오 이해를 위한 수동 모델은 일반적으로 '전체 시청(Watch-It-All)' 패러다임에 의존하여 질문 난이도와 관계없이 프레임을 균일하게 처리하므로 계산 비용이 비디오 길이에 따라 증가합니다. 대화형 프레임워크가 등장했지만, 이들은 종종 전역 사전 스캐닝에 의존하며 컨텍스트 비용이 여전히 비디오 길이에 비례하여 확장됩니다. 본 논문에서는 비디오 이해를 POMDP 기반의 반복적 관찰-사고-행동(Observation-Thought-Action) 주기로 정식화한 최초의 네이티브 옴니모달 에이전트인 OmniAgent를 제안합니다. OmniAgent는 요청 기반 행동을 실행하여 시청각 단서를 선택적으로 지속적 텍스트 메모리로 추출함으로써 추론 복잡성을 원시 비디오 길이로부터 효과적으로 분리합니다. 이를 구현하기 위해 (1) 이중 단계 품질 관리를 통한 최상의 N 궤적 합성으로 네이티브 능동 지각을 부트스트래핑하는 에이전트 기반 지도 미세 조정(Agentic Supervised Fine-Tuning)과 (2) 턴 수준 엔트로피를 활용하여 중요한 발견 턴으로 신용 할당을 유도하는 TAURA(턴 인지 적응형 불확실성 재조정 이득, Turn-aware Adaptive Uncertainty Rescaled Advantage)를 통한 에이전트 기반 강화 학습(Agentic Reinforcement Learning)을 도입합니다. 결정적으로, OmniAgent는 양의 테스트 시간 확장을 보여주며, 추론 턴 수가 증가함에 따라 성능이 향상되어 능동 지각의 효용성을 입증합니다. 10개 벤치마크(예: VideoMME, LVBench)에 걸친 실험 결과는 OmniAgent가 오픈소스 모델 중 최첨단 성능을 달성함을 보여줍니다. 특히 LVBench에서 7B 에이전트는 10배 더 큰 Qwen2.5-VL-72B를 능가합니다(50.5% 대 47.3%).

English

Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-of-N trajectory synthesis with dual-stage quality control, and (2) Agentic Reinforcement Learning with TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage), which leverages turn-level entropy to steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy of active perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10times larger Qwen2.5-VL-72B (50.5% vs. 47.3%).