SAGE:基于强化学习的智能任意视界智能体长视频推理训练
SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning
December 15, 2025
作者: Jitesh Jain, Jialuo Li, Zixian Ma, Jieyu Zhang, Chris Dongjoo Kim, Sangho Lee, Rohun Tripathi, Tanmay Gupta, Christopher Clark, Humphrey Shi
cs.AI
摘要
作为人类,我们天生具备任意时间跨度的推理能力——既能根据任务需求选择快速浏览长视频,也能在必要时完整观看短视频。基于这一认知,人们自然期望视频推理模型能够灵活处理不同时长的内容。然而当前最先进的模型仍采用单轮推理模式处理大量视频帧,这类似于强制观看完整长视频,需要消耗大量计算资源。这引发了一个关键问题:能否开发出高性能的任意跨度视频推理系统?
受人类行为启发,我们首先提出SAGE智能体系统:既能对长视频进行多轮推理,也能用单轮处理简单问题。其次,我们利用Gemini-2.5-Flash构建了简易合成数据生成流程,用以训练系统核心协调器SAGE-MM。我们还设计了有效的强化学习微调方案,该方案对培养SAGE-MM的任意跨度推理能力至关重要。第三,我们构建了平均时长超过700秒的SAGE-Bench评估基准,专门针对现实娱乐场景的视频推理能力进行测试。最后通过实证研究验证了系统、数据及强化学习方案的有效性:在开放式视频推理任务中实现最高6.1%的性能提升,对超过10分钟的长视频更取得8.2%的显著改进。
English
As humans, we are natural any-horizon reasoners, i.e., we can decide whether to iteratively skim long videos or watch short ones in full when necessary for a given task. With this in mind, one would expect video reasoning models to reason flexibly across different durations. However, SOTA models are still trained to predict answers in a single turn while processing a large number of frames, akin to watching an entire long video, requiring significant resources. This raises the question: Is it possible to develop performant any-horizon video reasoning systems? Inspired by human behavior, we first propose SAGE, an agent system that performs multi-turn reasoning on long videos while handling simpler problems in a single turn. Secondly, we introduce an easy synthetic data generation pipeline using Gemini-2.5-Flash to train the orchestrator, SAGE-MM, which lies at the core of SAGE. We further propose an effective RL post-training recipe essential for instilling any-horizon reasoning ability in SAGE-MM. Thirdly, we curate SAGE-Bench with an average duration of greater than 700 seconds for evaluating video reasoning ability in real-world entertainment use cases. Lastly, we empirically validate the effectiveness of our system, data, and RL recipe, observing notable improvements of up to 6.1% on open-ended video reasoning tasks, as well as an impressive 8.2% improvement on videos longer than 10 minutes.