长视频R1：低成本长视频理解的智能导航技术

摘要

本文针对低算力条件下的长视频理解这一关键且尚未充分探索的挑战，提出了LongVideo-R1——一种具备推理能力的主动式多模态大语言模型（MLLM）智能体，旨在实现高效视频上下文导航，避免穷举搜索带来的冗余。LongVideo-R1的核心在于其推理模块，该模块利用高层视觉线索推断最具信息量的视频片段以供后续处理。在推理过程中，智能体从顶层视觉摘要开始遍历，并迭代式细化关注区域，一旦获取足够回答查询的信息即终止探索过程。为支持训练，我们首先从带有定位标注的视频语料库CGBench中提取层次化视频描述，并指导GPT-5生成3.3万条高质量的工具增强思维链轨迹。LongVideo-R1智能体基于Qwen-3-8B模型通过两阶段范式进行微调：先进行监督微调（SFT），随后采用强化学习（RL），其中RL通过专门设计的奖励函数来最大化选择性片段导航的效率。在多个长视频基准测试上的实验验证了该方法的有效性，其在问答准确性与效率之间实现了更优的平衡。所有整理的数据和源代码均提供于补充材料中并将公开。代码与数据详见：https://github.com/qiujihao19/LongVideo-R1

English

This paper addresses the critical and underexplored challenge of long video understanding with low computational budgets. We propose LongVideo-R1, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient video context navigation, avoiding the redundancy of exhaustive search. At the core of LongVideo-R1 lies a reasoning module that leverages high-level visual cues to infer the most informative video clip for subsequent processing. During inference, the agent initiates traversal from top-level visual summaries and iteratively refines its focus, immediately halting the exploration process upon acquiring sufficient knowledge to answer the query. To facilitate training, we first extract hierarchical video captions from CGBench, a video corpus with grounding annotations, and guide GPT-5 to generate 33K high-quality chain-of-thought-with-tool trajectories. The LongVideo-R1 agent is fine-tuned upon the Qwen-3-8B model through a two-stage paradigm: supervised fine-tuning (SFT) followed by reinforcement learning (RL), where RL employs a specifically designed reward function to maximize selective and efficient clip navigation. Experiments on multiple long video benchmarks validate the effectiveness of name, which enjoys superior tradeoff between QA accuracy and efficiency. All curated data and source code are provided in the supplementary material and will be made publicly available. Code and data are available at: https://github.com/qiujihao19/LongVideo-R1