ChatPaper.aiChatPaper

Open-o3 影片:基於顯性時空證據的影片推理系統

Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

October 23, 2025
作者: Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, Zhuochen Wang
cs.AI

摘要

多數影片推理模型僅能生成文字推理軌跡,卻無法標示關鍵證據出現的時空位置。近期如OpenAI-o3等模型雖在圖像領域引發以證據為核心的推理熱潮,但將此能力擴展至影片面臨更大挑戰,因其需在動態場景中同步實現時間追蹤與空間定位。我們提出Open-o3 Video非代理框架,將顯式時空證據整合至影片推理,並通過精心收集訓練數據與設計訓練策略應對上述挑戰。該模型在生成答案時同步標註關鍵時間戳、物體及邊界框,使推理過程紮根於具體視覺觀測。為實現此功能,我們首先構建兩個高質量數據集:用於SFT的STGR-CoT-30k與用於RL的STGR-RL-36k,其中包含精心設計的時空標註——因現有數據集多僅提供影片時間片段或圖像空間框,缺乏統一的時空監督與推理軌跡。隨後採用冷啟動強化學習策略,搭配多項特製獎勵函數,共同促進答案準確性、時間對齊度與空間精確度。在V-STAR基準測試中,Open-o3 Video實現突破性表現,較Qwen2.5-VL基線將mAM提升14.4%、mLGM提升24.2%。在VideoMME、WorldSense、VideoMMMU及TVGBench等廣泛影片理解基準上也觀察到一致性提升。除準確度外,該模型生成的推理軌跡更為測試時縮放提供有價值信號,支持置信度感知驗證並提升答案可靠性。
English
Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging, as it requires joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3 Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning, and carefully collect training data and design training strategies to address the aforementioned challenges. The model highlights key timestamps, objects, and bounding boxes alongside its answers, allowing reasoning to be grounded in concrete visual observations. To enable this functionality, we first curate and build two high-quality datasets, STGR-CoT-30k for SFT and STGR-RL-36k for RL, with carefully constructed temporal and spatial annotations, since most existing datasets offer either temporal spans for videos or spatial boxes on images, lacking unified spatio-temporal supervision and reasoning traces. Then, we adopt a cold-start reinforcement learning strategy with multiple specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance, raising mAM by 14.4% and mLGM by 24.2% on the Qwen2.5-VL baseline. Consistent improvements are also observed on a broad range of video understanding benchmarks, including VideoMME, WorldSense, VideoMMMU, and TVGBench. Beyond accuracy, the reasoning traces produced by Open-o3 Video also provide valuable signals for test-time scaling, enabling confidence-aware verification and improving answer reliability.
PDF553December 2, 2025