思考し、そして採点せよ：ビデオ報酬モデリングにおける分離された推論と評価

要旨

近年、生成的ビデオモデルの進歩は、学習後およびテスト時スケーリングによって推進されることが増えており、これらは両方ともビデオ報酬モデル（RM）の品質に大きく依存している。理想的な報酬モデルは、多様なシナリオにおいて人間の選好と一致する正確な報酬を予測すべきである。しかし、既存のパラダイムは根本的なジレンマに直面している。識別的RMは、明示的な推論なしにマルチモーダル大規模言語モデル（MLLM）によって抽出された特徴量に直接報酬を回帰するため、ショートカット学習に陥りやすく、汎化のために大規模なデータスケーリングに強く依存する。対照的に、連鎖思考（CoT）推論を備えた生成的RMは、人間の選好の背後にある論理を内在化するために細かな意味的監督を活用するため、優れた解釈性と汎化可能性を示す。しかし、単一の自己回帰推論チェーン内で推論と採点が結合されているため、固有の最適化ボトルネックに苦しむ。我々は、CoT推論の汎化上の利点を活用しつつ、結合された推論と採点の学習不安定性を緩和するため、学習効率が高く汎化可能なビデオ報酬モデル「DeScore」を提案する。DeScoreは分離型の「思考してから採点する」パラダイムを採用する。まずMLLMが明示的なCoTを生成し、その後、学習可能なクエリトークンと最終報酬を予測する回帰ヘッドから構成される専用の識別的採点モジュールが続く。DeScoreは2段階のフレームワークで最適化される。(1) 頑健な採点能力を確保するためのランダムマスク機構を含む識別的コールドスタート、(2) CoT推論の質を独立に洗練し最終報酬を較正する双目的強化学習段階。これにより、高品質な推論が直接的に優れたモデル性能に繋がることが保証される。

English

Recent advances in generative video models are increasingly driven by post-training and test-time scaling, both of which critically depend on the quality of video reward models (RMs). An ideal reward model should predict accurate rewards that align with human preferences across diverse scenarios. However, existing paradigms face a fundamental dilemma: Discriminative RMs regress rewards directly on features extracted by multimodal large language models (MLLMs) without explicit reasoning, making them prone to shortcut learning and heavily reliant on massive data scaling for generalization. In contrast, Generative RMs with Chain-of-Thought (CoT) reasoning exhibit superior interpretability and generalization potential, as they leverage fine-grained semantic supervision to internalize the rationales behind human preferences. However, they suffer from inherent optimization bottlenecks due to the coupling of reasoning and scoring within a single autoregressive inference chain. To harness the generalization benefits of CoT reasoning while mitigating the training instability of coupled reasoning and scoring, we introduce DeScore, a training-efficient and generalizable video reward model. DeScore employs a decoupled ``think-then-score'' paradigm: an MLLM first generates an explicit CoT, followed by a dedicated discriminative scoring module consisting of a learnable query token and a regression head that predicts the final reward. DeScore is optimized via a two-stage framework: (1) a discriminative cold start incorporating a random mask mechanism to ensure robust scoring capabilities, and (2) a dual-objective reinforcement learning stage that independently refines CoT reasoning quality and calibrates the final reward, ensuring that higher-quality reasoning directly translates to superior model performance.

思考し、そして採点せよ：ビデオ報酬モデリングにおける分離された推論と評価

Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

要旨

Support