ChatPaper.aiChatPaper

SAGE:基于强化学习的智能任意视界智能体长视频推理训练

SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning

December 15, 2025
作者: Jitesh Jain, Jialuo Li, Zixian Ma, Jieyu Zhang, Chris Dongjoo Kim, Sangho Lee, Rohun Tripathi, Tanmay Gupta, Christopher Clark, Humphrey Shi
cs.AI

摘要

作为人类,我们是天生的任意时间跨度推理者——即针对特定任务需求,我们能够自主决定是快速浏览长视频还是完整观看短视频。基于这一认知,人们自然期望视频推理模型具备跨时长灵活推理的能力。然而,当前最先进的模型仍采用单轮推理范式处理海量帧序列(如同完整观看长视频),需要消耗大量计算资源。这引发了一个关键问题:能否开发出高性能的任意时间跨度视频推理系统?受人类行为启发,我们首先提出SAGE智能体系统,该系统既能对长视频进行多轮推理,又能以单轮方式处理简单问题。其次,我们利用Gemini-2.5-Flash构建了轻量化的合成数据生成流程,用以训练SAGE的核心调度器SAGE-MM。我们还设计了有效的强化学习微调方案,该方案对培养SAGE-MM的任意时间跨度推理能力至关重要。第三,我们构建了平均时长超过700秒的SAGE-Bench基准数据集,用于评估真实娱乐场景下的视频推理能力。最后,我们通过实证研究验证了系统架构、数据生成方法和强化学习方案的有效性:在开放式视频推理任务中实现最高6.1%的性能提升,针对超过10分钟的长视频更取得8.2%的显著进步。
English
As humans, we are natural any-horizon reasoners, i.e., we can decide whether to iteratively skim long videos or watch short ones in full when necessary for a given task. With this in mind, one would expect video reasoning models to reason flexibly across different durations. However, SOTA models are still trained to predict answers in a single turn while processing a large number of frames, akin to watching an entire long video, requiring significant resources. This raises the question: Is it possible to develop performant any-horizon video reasoning systems? Inspired by human behavior, we first propose SAGE, an agent system that performs multi-turn reasoning on long videos while handling simpler problems in a single turn. Secondly, we introduce an easy synthetic data generation pipeline using Gemini-2.5-Flash to train the orchestrator, SAGE-MM, which lies at the core of SAGE. We further propose an effective RL post-training recipe essential for instilling any-horizon reasoning ability in SAGE-MM. Thirdly, we curate SAGE-Bench with an average duration of greater than 700 seconds for evaluating video reasoning ability in real-world entertainment use cases. Lastly, we empirically validate the effectiveness of our system, data, and RL recipe, observing notable improvements of up to 6.1% on open-ended video reasoning tasks, as well as an impressive 8.2% improvement on videos longer than 10 minutes.
PDF142December 19, 2025