Video-MTR:面向长视频理解的强化多轮推理
Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding
August 28, 2025
作者: Yuan Xie, Tianshui Chen, Zheng Ge, Lionel Ni
cs.AI
摘要
长视频理解,以其长程时间依赖性和多事件特性为特征,仍然是一个挑战。现有方法通常依赖于静态推理或外部视觉-语言模型(VLMs),这些方法因缺乏端到端训练而面临复杂性和性能欠佳的问题。本文提出Video-MTR,一种强化多轮推理框架,旨在实现迭代式关键视频片段选择与问题理解。不同于传统视频推理管道一次性生成预测,Video-MTR通过多轮推理,基于对已处理片段和当前问题的逐步深入理解,渐进式地选择视频片段。这一迭代过程使得视频分析更为精细且上下文感知。为确保中间推理过程的有效性,我们引入了一种新颖的门控双层奖励系统,结合基于答案正确性的轨迹级奖励和强调帧-查询相关性的轮次级奖励。该系统优化了视频片段选择与问题理解,无需依赖外部VLMs,实现了端到端训练。在VideoMME、MLVU和EgoSchema等基准上的大量实验表明,Video-MTR在准确性和效率上均优于现有方法,推动了长视频理解领域的前沿发展。
English
Long-form video understanding, characterized by long-range temporal
dependencies and multiple events, remains a challenge. Existing methods often
rely on static reasoning or external visual-language models (VLMs), which face
issues like complexity and sub-optimal performance due to the lack of
end-to-end training. In this paper, we propose Video-MTR, a reinforced
multi-turn reasoning framework designed to enable iterative key video segment
selection and question comprehension. Unlike traditional video reasoning
pipeline, which generate predictions in a single turn, Video-MTR performs
reasoning in multiple turns, selecting video segments progressively based on
the evolving understanding of previously processed segments and the current
question. This iterative process allows for a more refined and contextually
aware analysis of the video. To ensure intermediate reasoning process, we
introduce a novel gated bi-level reward system, combining trajectory-level
rewards based on answer correctness and turn-level rewards emphasizing
frame-query relevance. This system optimizes both video segment selection and
question comprehension, eliminating the need for external VLMs and allowing
end-to-end training. Extensive experiments on benchmarks like VideoMME, MLVU,
and EgoSchema demonstrate that Video-MTR outperforms existing methods in both
accuracy and efficiency, advancing the state-of-the-art in long video
understanding.