MOSS-ChatV：基于过程推理奖励的视频时序推理强化学习

摘要

视频推理已成为多模态大语言模型（MLLMs）的一项关键能力，要求模型超越静态感知，实现对复杂场景中时间动态的连贯理解。然而，现有的MLLMs常表现出过程不一致性，即中间推理偏离视频动态，即便最终答案正确，也削弱了模型的可解释性和鲁棒性。为解决这一问题，我们提出了MOSS-ChatV，一个结合动态时间规整（DTW）过程奖励的强化学习框架。这一基于规则的奖励机制使推理轨迹与时间锚定的参考对齐，无需辅助奖励模型即可实现高效的过程监督。我们进一步将动态状态预测确立为视频推理的关键衡量标准，并构建了MOSS-Video基准，该基准包含标注的推理轨迹，其中训练集用于微调MOSS-ChatV，而保留集则用于评估。MOSS-ChatV在MOSS-Video（测试集）上达到了87.2%的准确率，并在MVBench和MMVU等通用视频基准上提升了性能。该框架在不同架构（包括Qwen2.5-VL和Phi-2）中均能带来一致性能提升，证实了其广泛适用性。通过GPT-4o作为评判者的进一步评估显示，MOSS-ChatV生成的推理轨迹更加一致和稳定。

English

Video reasoning has emerged as a critical capability for multimodal large language models (MLLMs), requiring models to move beyond static perception toward coherent understanding of temporal dynamics in complex scenes. Yet existing MLLMs often exhibit process inconsistency, where intermediate reasoning drifts from video dynamics even when the final answer is correct, undermining interpretability and robustness. To address this issue, we introduce MOSS-ChatV, a reinforcement learning framework with a Dynamic Time Warping (DTW)-based process reward. This rule-based reward aligns reasoning traces with temporally grounded references, enabling efficient process supervision without auxiliary reward models. We further identify dynamic state prediction as a key measure of video reasoning and construct MOSS-Video, a benchmark with annotated reasoning traces, where the training split is used to fine-tune MOSS-ChatV and the held-out split is reserved for evaluation. MOSS-ChatV achieves 87.2\% on MOSS-Video (test) and improves performance on general video benchmarks such as MVBench and MMVU. The framework consistently yields gains across different architectures, including Qwen2.5-VL and Phi-2, confirming its broad applicability. Evaluations with GPT-4o-as-judge further show that MOSS-ChatV produces more consistent and stable reasoning traces.

MOSS-ChatV：基于过程推理奖励的视频时序推理强化学习

MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning

摘要

Support