LongVideo-R1:面向低成本长视频理解的智能导航系统
LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding
February 24, 2026
作者: Jihao Qiu, Lingxi Xie, Xinyue Huo, Qi Tian, Qixiang Ye
cs.AI
摘要
本文针对低计算预算下的长视频理解这一关键且尚未被充分探索的挑战,提出了LongVideo-R1——一种具备推理能力的主动式多模态大语言模型(MLLM)智能体。该模型专为实现高效视频上下文导航而设计,避免了穷举搜索的冗余性。LongVideo-R1的核心在于其推理模块,该模块利用高层视觉线索推断最具信息量的视频片段以供后续处理。在推理过程中,智能体从顶层视觉摘要开始遍历,并迭代式细化关注范围,一旦获取足够信息以回答查询即立即终止探索过程。为辅助训练,我们首先从带有定位标注的视频语料库CGBench中提取层次化视频描述,并引导GPT-5生成3.3万条高质量的工具增强思维链轨迹。LongVideo-R1基于Qwen-3-8B模型通过两阶段范式进行微调:监督微调(SFT)后接强化学习(RL),其中RL采用专门设计的奖励函数以最大化选择性片段导航的效率。在多个长视频基准测试上的实验验证了该方法的有效性,其在问答准确性与效率之间实现了更优的平衡。所有整理的数据与源代码均提供于补充材料中并将公开。代码与数据详见:https://github.com/qiujihao19/LongVideo-R1
English
This paper addresses the critical and underexplored challenge of long video understanding with low computational budgets. We propose LongVideo-R1, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient video context navigation, avoiding the redundancy of exhaustive search. At the core of LongVideo-R1 lies a reasoning module that leverages high-level visual cues to infer the most informative video clip for subsequent processing. During inference, the agent initiates traversal from top-level visual summaries and iteratively refines its focus, immediately halting the exploration process upon acquiring sufficient knowledge to answer the query. To facilitate training, we first extract hierarchical video captions from CGBench, a video corpus with grounding annotations, and guide GPT-5 to generate 33K high-quality chain-of-thought-with-tool trajectories. The LongVideo-R1 agent is fine-tuned upon the Qwen-3-8B model through a two-stage paradigm: supervised fine-tuning (SFT) followed by reinforcement learning (RL), where RL employs a specifically designed reward function to maximize selective and efficient clip navigation. Experiments on multiple long video benchmarks validate the effectiveness of name, which enjoys superior tradeoff between QA accuracy and efficiency. All curated data and source code are provided in the supplementary material and will be made publicly available. Code and data are available at: https://github.com/qiujihao19/LongVideo-R1