MUSEG: 타임스탬프 인식 다중 세그먼트 그라운딩을 통한 비디오 시간적 이해 강화

초록

비디오 시간적 이해는 다중모달 대형 언어 모델(MLLMs)이 비디오 내 이벤트를 추론하는 데 있어 핵심적입니다. 일반적인 비디오 이해 분야에서의 최근 발전에도 불구하고, 현재의 MLLMs는 여전히 세밀한 시간적 추론에 어려움을 겪고 있습니다. 최근 이 문제를 해결하기 위해 강화 학습(RL)이 탐구되었지만, 기존의 RL 접근법은 효과성에 있어 한계를 보입니다. 본 연구에서는 MUSEG라는 새로운 RL 기반 방법을 제안하며, 이는 타임스탬프 인식 다중 세그먼트 그라운딩을 도입하여 시간적 이해를 강화합니다. MUSEG는 MLLMs가 쿼리를 여러 관련 비디오 세그먼트와 정렬할 수 있게 하여 더 포괄적인 시간적 추론을 촉진합니다. 효과적인 학습을 위해, 점진적으로 모델을 시간적으로 그라운드된 추론으로 이끄는 단계별 보상을 포함한 맞춤형 RL 훈련 레시피를 설계했습니다. 시간적 그라운딩 및 시간 민감 비디오 QA 작업에 대한 광범위한 실험을 통해 MUSEG가 기존 방법을 크게 능가하며 다양한 시간적 이해 시나리오에서 잘 일반화됨을 입증했습니다. 프로젝트는 https://github.com/THUNLP-MT/MUSEG에서 확인할 수 있습니다.

English

Video temporal understanding is crucial for multimodal large language models (MLLMs) to reason over events in videos. Despite recent advances in general video understanding, current MLLMs still struggle with fine-grained temporal reasoning. While reinforcement learning (RL) has been explored to address this issue recently, existing RL approaches remain limited in effectiveness. In this work, we propose MUSEG, a novel RL-based method that enhances temporal understanding by introducing timestamp-aware multi-segment grounding. MUSEG enables MLLMs to align queries with multiple relevant video segments, promoting more comprehensive temporal reasoning. To facilitate effective learning, we design a customized RL training recipe with phased rewards that progressively guides the model toward temporally grounded reasoning. Extensive experiments on temporal grounding and time-sensitive video QA tasks demonstrate that MUSEG significantly outperforms existing methods and generalizes well across diverse temporal understanding scenarios. View our project at https://github.com/THUNLP-MT/MUSEG.

MUSEG: 타임스탬프 인식 다중 세그먼트 그라운딩을 통한 비디오 시간적 이해 강화

MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding

초록

Support