Video-MTR：面向长视频理解的强化多轮推理

摘要

长视频理解，以其长程时间依赖性和多事件特性为特征，仍然是一个挑战。现有方法通常依赖于静态推理或外部视觉-语言模型（VLMs），这些方法因缺乏端到端训练而面临复杂性和性能欠佳的问题。本文提出Video-MTR，一种强化多轮推理框架，旨在实现迭代式关键视频片段选择与问题理解。不同于传统视频推理管道一次性生成预测，Video-MTR通过多轮推理，基于对已处理片段和当前问题的逐步深入理解，渐进式地选择视频片段。这一迭代过程使得视频分析更为精细且上下文感知。为确保中间推理过程的有效性，我们引入了一种新颖的门控双层奖励系统，结合基于答案正确性的轨迹级奖励和强调帧-查询相关性的轮次级奖励。该系统优化了视频片段选择与问题理解，无需依赖外部VLMs，实现了端到端训练。在VideoMME、MLVU和EgoSchema等基准上的大量实验表明，Video-MTR在准确性和效率上均优于现有方法，推动了长视频理解领域的前沿发展。

English

Long-form video understanding, characterized by long-range temporal dependencies and multiple events, remains a challenge. Existing methods often rely on static reasoning or external visual-language models (VLMs), which face issues like complexity and sub-optimal performance due to the lack of end-to-end training. In this paper, we propose Video-MTR, a reinforced multi-turn reasoning framework designed to enable iterative key video segment selection and question comprehension. Unlike traditional video reasoning pipeline, which generate predictions in a single turn, Video-MTR performs reasoning in multiple turns, selecting video segments progressively based on the evolving understanding of previously processed segments and the current question. This iterative process allows for a more refined and contextually aware analysis of the video. To ensure intermediate reasoning process, we introduce a novel gated bi-level reward system, combining trajectory-level rewards based on answer correctness and turn-level rewards emphasizing frame-query relevance. This system optimizes both video segment selection and question comprehension, eliminating the need for external VLMs and allowing end-to-end training. Extensive experiments on benchmarks like VideoMME, MLVU, and EgoSchema demonstrate that Video-MTR outperforms existing methods in both accuracy and efficiency, advancing the state-of-the-art in long video understanding.

Video-MTR：面向长视频理解的强化多轮推理

Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding

摘要

Support