MOSS-ChatV：ビデオ時間推論のためのプロセス推論報酬を用いた強化学習

要旨

ビデオ推論は、マルチモーダル大規模言語モデル（MLLM）にとって重要な能力として浮上しており、モデルが静的な知覚を超えて、複雑なシーンにおける時間的ダイナミクスの一貫した理解に向かうことを要求しています。しかし、既存のMLLMはしばしばプロセスの不整合を示し、最終的な答えが正しい場合でも、中間推論がビデオのダイナミクスから逸脱し、解釈可能性と堅牢性を損なうことがあります。この問題に対処するため、我々はMOSS-ChatVを導入します。これは、動的時間ワーピング（DTW）に基づくプロセス報酬を持つ強化学習フレームワークです。このルールベースの報酬は、推論の軌跡を時間的に根拠のある参照と整合させ、補助的な報酬モデルなしで効率的なプロセス監視を可能にします。さらに、我々は動的状態予測をビデオ推論の重要な尺度として特定し、注釈付き推論軌跡を持つベンチマークMOSS-Videoを構築しました。ここで、トレーニング分割はMOSS-ChatVの微調整に使用され、保持された分割は評価のために予約されています。MOSS-ChatVは、MOSS-Video（テスト）で87.2％を達成し、MVBenchやMMVUなどの一般的なビデオベンチマークでのパフォーマンスを向上させます。このフレームワークは、Qwen2.5-VLやPhi-2を含むさまざまなアーキテクチャで一貫して利得をもたらし、その広範な適用性を確認しています。GPT-4o-as-judgeによる評価はさらに、MOSS-ChatVがより一貫性があり安定した推論軌跡を生成することを示しています。

English

Video reasoning has emerged as a critical capability for multimodal large language models (MLLMs), requiring models to move beyond static perception toward coherent understanding of temporal dynamics in complex scenes. Yet existing MLLMs often exhibit process inconsistency, where intermediate reasoning drifts from video dynamics even when the final answer is correct, undermining interpretability and robustness. To address this issue, we introduce MOSS-ChatV, a reinforcement learning framework with a Dynamic Time Warping (DTW)-based process reward. This rule-based reward aligns reasoning traces with temporally grounded references, enabling efficient process supervision without auxiliary reward models. We further identify dynamic state prediction as a key measure of video reasoning and construct MOSS-Video, a benchmark with annotated reasoning traces, where the training split is used to fine-tune MOSS-ChatV and the held-out split is reserved for evaluation. MOSS-ChatV achieves 87.2\% on MOSS-Video (test) and improves performance on general video benchmarks such as MVBench and MMVU. The framework consistently yields gains across different architectures, including Qwen2.5-VL and Phi-2, confirming its broad applicability. Evaluations with GPT-4o-as-judge further show that MOSS-ChatV produces more consistent and stable reasoning traces.

MOSS-ChatV：ビデオ時間推論のためのプロセス推論報酬を用いた強化学習

MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning

要旨

Support