思而后评：解耦推理与评分的视频奖励建模方法

摘要

近年来，生成式视频模型的进步日益依赖于训练后与测试阶段的规模扩展，这两者都关键取决于视频奖励模型的质量。理想的奖励模型应能预测符合人类偏好的精准奖励值，并覆盖多样化场景。然而现有范式面临根本性困境：判别式奖励模型直接基于多模态大语言模型提取的特征进行奖励回归，缺乏显式推理过程，易陷入捷径学习且严重依赖海量数据扩展来实现泛化。相比之下，采用思维链推理的生成式奖励模型通过细粒度语义监督内化人类偏好的决策依据，展现出更优的可解释性与泛化潜力，但由于推理与评分耦合在单一自回归推断链中，存在固有优化瓶颈。为兼顾思维链推理的泛化优势并缓解耦合式推理与评分带来的训练不稳定问题，我们提出DeScore——一种训练高效且泛化能力强的视频奖励模型。该模型采用解耦的"先思考后评分"范式：首先由多模态大语言模型生成显式思维链，随后由包含可学习查询标记和回归头的专用判别式评分模块预测最终奖励。DeScore通过两阶段框架进行优化：（1）采用随机掩码机制的判别式冷启动确保稳健的评分能力；（2）双目标强化学习阶段分别优化思维链推理质量与校准最终奖励，确保更高质量的推理能直接转化为更优的模型性能。

English

Recent advances in generative video models are increasingly driven by post-training and test-time scaling, both of which critically depend on the quality of video reward models (RMs). An ideal reward model should predict accurate rewards that align with human preferences across diverse scenarios. However, existing paradigms face a fundamental dilemma: Discriminative RMs regress rewards directly on features extracted by multimodal large language models (MLLMs) without explicit reasoning, making them prone to shortcut learning and heavily reliant on massive data scaling for generalization. In contrast, Generative RMs with Chain-of-Thought (CoT) reasoning exhibit superior interpretability and generalization potential, as they leverage fine-grained semantic supervision to internalize the rationales behind human preferences. However, they suffer from inherent optimization bottlenecks due to the coupling of reasoning and scoring within a single autoregressive inference chain. To harness the generalization benefits of CoT reasoning while mitigating the training instability of coupled reasoning and scoring, we introduce DeScore, a training-efficient and generalizable video reward model. DeScore employs a decoupled ``think-then-score'' paradigm: an MLLM first generates an explicit CoT, followed by a dedicated discriminative scoring module consisting of a learnable query token and a regression head that predicts the final reward. DeScore is optimized via a two-stage framework: (1) a discriminative cold start incorporating a random mask mechanism to ensure robust scoring capabilities, and (2) a dual-objective reinforcement learning stage that independently refines CoT reasoning quality and calibrates the final reward, ensuring that higher-quality reasoning directly translates to superior model performance.