Video-MTR: 장기 비디오 이해를 위한 강화된 다중 턴 추론

초록

장편 비디오 이해는 장기간의 시간적 의존성과 다중 이벤트로 특징지어져 여전히 도전적인 과제로 남아 있습니다. 기존 방법들은 종종 정적 추론이나 외부 시각-언어 모델(VLMs)에 의존하는데, 이는 복잡성과 종단간 학습의 부재로 인한 차선의 성능 문제를 겪습니다. 본 논문에서는 반복적인 주요 비디오 세그먼트 선택과 질문 이해를 가능하게 하는 강화된 다중 턴 추론 프레임워크인 Video-MTR을 제안합니다. 기존의 단일 턴에서 예측을 생성하는 전통적인 비디오 추론 파이프라인과 달리, Video-MTR은 이전에 처리된 세그먼트와 현재 질문에 대한 진화하는 이해를 바탕으로 점진적으로 비디오 세그먼트를 선택하며 다중 턴에서 추론을 수행합니다. 이 반복적인 프로세스는 비디오에 대한 더 정교하고 문맥을 고려한 분석을 가능하게 합니다. 중간 추론 과정을 보장하기 위해, 우리는 답변 정확도를 기반으로 한 궤적 수준 보상과 프레임-질문 관련성을 강조하는 턴 수준 보상을 결합한 새로운 게이트 이중 수준 보상 시스템을 도입했습니다. 이 시스템은 비디오 세그먼트 선택과 질문 이해를 최적화하며, 외부 VLMs의 필요성을 없애고 종단간 학습을 가능하게 합니다. VideoMME, MLVU, EgoSchema와 같은 벤치마크에서의 광범위한 실험을 통해 Video-MTR이 정확도와 효율성 모두에서 기존 방법들을 능가하며, 장편 비디오 이해 분야의 최신 기술을 발전시킴을 입증했습니다.

English

Long-form video understanding, characterized by long-range temporal dependencies and multiple events, remains a challenge. Existing methods often rely on static reasoning or external visual-language models (VLMs), which face issues like complexity and sub-optimal performance due to the lack of end-to-end training. In this paper, we propose Video-MTR, a reinforced multi-turn reasoning framework designed to enable iterative key video segment selection and question comprehension. Unlike traditional video reasoning pipeline, which generate predictions in a single turn, Video-MTR performs reasoning in multiple turns, selecting video segments progressively based on the evolving understanding of previously processed segments and the current question. This iterative process allows for a more refined and contextually aware analysis of the video. To ensure intermediate reasoning process, we introduce a novel gated bi-level reward system, combining trajectory-level rewards based on answer correctness and turn-level rewards emphasizing frame-query relevance. This system optimizes both video segment selection and question comprehension, eliminating the need for external VLMs and allowing end-to-end training. Extensive experiments on benchmarks like VideoMME, MLVU, and EgoSchema demonstrate that Video-MTR outperforms existing methods in both accuracy and efficiency, advancing the state-of-the-art in long video understanding.

Video-MTR: 장기 비디오 이해를 위한 강화된 다중 턴 추론

Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding

초록

Support