Open-o3视频:基于显式时空证据的接地视频推理
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence
October 23, 2025
作者: Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, Zhuochen Wang
cs.AI
摘要
当前大多数视频推理模型仅生成文本推理轨迹,而未指明关键证据出现的时间与位置。虽然OpenAI-o3等近期模型引发了图像领域以证据为中心的推理热潮,但将该能力延伸至视频面临更大挑战,因其需在动态场景中实现时序追踪与空间定位的协同。我们提出Open-o3 Video非智能体框架,将显式时空证据融入视频推理,通过精心构建训练数据与设计训练策略应对上述挑战。该模型在输出答案时同步标注关键时间戳、目标物体及边界框,使推理过程扎根于具体视觉观察。为实现此功能,我们首先构建两个高质量数据集:包含时空标注的STGR-CoT-30k用于SFT阶段,STGR-RL-36k用于RL阶段——因现有数据集多仅提供视频时间片段或图像空间框,缺乏统一的时空监督与推理轨迹。随后采用冷启动强化学习策略,设计多重奖励函数协同促进答案准确性、时序对齐度与空间精确性。在V-STAR基准测试中,Open-o3 Video实现最先进性能,较Qwen2.5-VL基线将mAM提升14.4%,mLGM提升24.2%。在VideoMME、WorldSense、VideoMMMU及TVGBench等广泛视频理解基准上也观察到一致提升。除准确性外,该模型生成的推理轨迹还为测试时缩放提供有价值信号,支持置信度感知验证并提升答案可靠性。
English
Most video reasoning models only generate textual reasoning traces without
indicating when and where key evidence appears. Recent models such as OpenAI-o3
have sparked wide interest in evidence-centered reasoning for images, yet
extending this ability to videos is more challenging, as it requires joint
temporal tracking and spatial localization across dynamic scenes. We introduce
Open-o3 Video, a non-agent framework that integrates explicit spatio-temporal
evidence into video reasoning, and carefully collect training data and design
training strategies to address the aforementioned challenges. The model
highlights key timestamps, objects, and bounding boxes alongside its answers,
allowing reasoning to be grounded in concrete visual observations. To enable
this functionality, we first curate and build two high-quality datasets,
STGR-CoT-30k for SFT and STGR-RL-36k for RL, with carefully constructed
temporal and spatial annotations, since most existing datasets offer either
temporal spans for videos or spatial boxes on images, lacking unified
spatio-temporal supervision and reasoning traces. Then, we adopt a cold-start
reinforcement learning strategy with multiple specially designed rewards that
jointly encourage answer accuracy, temporal alignment, and spatial precision.
On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance,
raising mAM by 14.4% and mLGM by 24.2% on the Qwen2.5-VL baseline. Consistent
improvements are also observed on a broad range of video understanding
benchmarks, including VideoMME, WorldSense, VideoMMMU, and TVGBench. Beyond
accuracy, the reasoning traces produced by Open-o3 Video also provide valuable
signals for test-time scaling, enabling confidence-aware verification and
improving answer reliability.