MOSS-ChatV:基於過程推理獎勵的強化學習用於視頻時序推理
MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning
September 25, 2025
作者: Sicheng Tao, Jungang Li, Yibo Yan, Junyan Zhang, Yubo Gao, Hanqian Li, ShuHang Xun, Yuxuan Fan, Hong Chen, Jianxiang He, Xuming Hu
cs.AI
摘要
視訊推理已成為多模態大型語言模型(MLLMs)的關鍵能力,要求模型超越靜態感知,實現對複雜場景中時間動態的連貫理解。然而,現有的MLLMs常表現出過程不一致性,即中間推理偏離視訊動態,即使最終答案正確,也削弱了可解釋性和魯棒性。為解決此問題,我們引入了MOSS-ChatV,這是一個基於動態時間規整(DTW)過程獎勵的強化學習框架。此基於規則的獎勵使推理軌跡與時間基礎參考對齊,實現了無需輔助獎勵模型的高效過程監督。我們進一步將動態狀態預測視為視訊推理的關鍵衡量指標,並構建了MOSS-Video,這是一個帶有註釋推理軌跡的基準,其中訓練集用於微調MOSS-ChatV,而保留集則用於評估。MOSS-ChatV在MOSS-Video(測試集)上達到了87.2%的成績,並在MVBench和MMVU等通用視訊基準上提升了表現。該框架在不同架構(包括Qwen2.5-VL和Phi-2)中均能帶來增益,證明了其廣泛適用性。使用GPT-4o作為評判者的評估進一步顯示,MOSS-ChatV產生了更一致且穩定的推理軌跡。
English
Video reasoning has emerged as a critical capability for multimodal large
language models (MLLMs), requiring models to move beyond static perception
toward coherent understanding of temporal dynamics in complex scenes. Yet
existing MLLMs often exhibit process inconsistency, where intermediate
reasoning drifts from video dynamics even when the final answer is correct,
undermining interpretability and robustness. To address this issue, we
introduce MOSS-ChatV, a reinforcement learning framework with a Dynamic Time
Warping (DTW)-based process reward. This rule-based reward aligns reasoning
traces with temporally grounded references, enabling efficient process
supervision without auxiliary reward models. We further identify dynamic state
prediction as a key measure of video reasoning and construct MOSS-Video, a
benchmark with annotated reasoning traces, where the training split is used to
fine-tune MOSS-ChatV and the held-out split is reserved for evaluation.
MOSS-ChatV achieves 87.2\% on MOSS-Video (test) and improves performance on
general video benchmarks such as MVBench and MMVU. The framework consistently
yields gains across different architectures, including Qwen2.5-VL and Phi-2,
confirming its broad applicability. Evaluations with GPT-4o-as-judge further
show that MOSS-ChatV produces more consistent and stable reasoning traces.