MUSEG：通過時間戳感知的多片段定位強化視頻時序理解

摘要

視頻時間理解對於多模態大型語言模型（MLLMs）在推理視頻事件中至關重要。儘管在通用視頻理解方面取得了最新進展，當前的MLLMs在細粒度時間推理上仍面臨挑戰。雖然近期已探索使用強化學習（RL）來解決這一問題，但現有的RL方法在效果上仍顯不足。本研究提出MUSEG，一種新穎的基於RL的方法，通過引入時間戳感知的多片段定位來增強時間理解能力。MUSEG使MLLMs能夠將查詢與多個相關視頻片段對齊，從而促進更全面的時間推理。為了實現有效學習，我們設計了一種定制的RL訓練方案，採用分階段獎勵，逐步引導模型進行時間定位推理。在時間定位和時間敏感視頻問答任務上的大量實驗表明，MUSEG顯著優於現有方法，並在多樣化的時間理解場景中展現出良好的泛化能力。請訪問我們的項目頁面：https://github.com/THUNLP-MT/MUSEG。

English

Video temporal understanding is crucial for multimodal large language models (MLLMs) to reason over events in videos. Despite recent advances in general video understanding, current MLLMs still struggle with fine-grained temporal reasoning. While reinforcement learning (RL) has been explored to address this issue recently, existing RL approaches remain limited in effectiveness. In this work, we propose MUSEG, a novel RL-based method that enhances temporal understanding by introducing timestamp-aware multi-segment grounding. MUSEG enables MLLMs to align queries with multiple relevant video segments, promoting more comprehensive temporal reasoning. To facilitate effective learning, we design a customized RL training recipe with phased rewards that progressively guides the model toward temporally grounded reasoning. Extensive experiments on temporal grounding and time-sensitive video QA tasks demonstrate that MUSEG significantly outperforms existing methods and generalizes well across diverse temporal understanding scenarios. View our project at https://github.com/THUNLP-MT/MUSEG.

MUSEG：通過時間戳感知的多片段定位強化視頻時序理解

MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding

摘要

Support