思考而后评分:视频奖励建模中的解耦推理与评分机制
Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling
May 7, 2026
作者: Yuan Wang, Ouxiang Li, Yulong Xu, Borui Liao, Jiajun Liang, Jinghan Li, Meng Wang, Xintao Wang, Pengfei Wang, Kuien Liu, Xiang Wang
cs.AI
摘要
近年来,生成式视频模型的发展日益依赖于训练后与测试时的规模扩展,这两者都关键取决于视频奖励模型的质量。理想的奖励模型应能预测符合人类偏好的精准奖励,并覆盖多样化场景。然而现有范式面临根本性困境:判别式奖励模型直接基于多模态大语言模型提取的特征进行奖励回归,缺乏显式推理过程,易陷入捷径学习且严重依赖海量数据扩展来实现泛化。相比之下,具备思维链推理能力的生成式奖励模型通过细粒度语义监督内化人类偏好的决策依据,展现出更优的可解释性与泛化潜力,但其自回归推理链中推理与评分机制的耦合导致了固有的优化瓶颈。为兼顾思维链推理的泛化优势并缓解耦合式训练的不稳定性,我们提出DeScore——一种训练高效且泛化能力强的视频奖励模型。该模型采用解耦的"先思考后评分"范式:首先由多模态大语言模型生成显式思维链,随后通过由可学习查询标记和回归头构成的专用判别式评分模块预测最终奖励。DeScore通过两阶段框架进行优化:(1)采用随机掩码机制的判别式冷启动确保稳健的评分能力;(2)双目标强化学习阶段分别优化思维链推理质量与校准最终奖励,使更高质量的推理直接转化为更优的模型性能。
English
Recent advances in generative video models are increasingly driven by post-training and test-time scaling, both of which critically depend on the quality of video reward models (RMs). An ideal reward model should predict accurate rewards that align with human preferences across diverse scenarios. However, existing paradigms face a fundamental dilemma: Discriminative RMs regress rewards directly on features extracted by multimodal large language models (MLLMs) without explicit reasoning, making them prone to shortcut learning and heavily reliant on massive data scaling for generalization. In contrast, Generative RMs with Chain-of-Thought (CoT) reasoning exhibit superior interpretability and generalization potential, as they leverage fine-grained semantic supervision to internalize the rationales behind human preferences. However, they suffer from inherent optimization bottlenecks due to the coupling of reasoning and scoring within a single autoregressive inference chain. To harness the generalization benefits of CoT reasoning while mitigating the training instability of coupled reasoning and scoring, we introduce DeScore, a training-efficient and generalizable video reward model. DeScore employs a decoupled ``think-then-score'' paradigm: an MLLM first generates an explicit CoT, followed by a dedicated discriminative scoring module consisting of a learnable query token and a regression head that predicts the final reward. DeScore is optimized via a two-stage framework: (1) a discriminative cold start incorporating a random mask mechanism to ensure robust scoring capabilities, and (2) a dual-objective reinforcement learning stage that independently refines CoT reasoning quality and calibrates the final reward, ensuring that higher-quality reasoning directly translates to superior model performance.