通过将推理内化到分数分布中超越标量奖励
Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions
June 8, 2026
作者: Xin Jin, Huanqia Cai, Zhen Li, Zechao Zhan, Dengyang Jiang, Aiming Hao, Yuming Jiang, Chunle Guo, Peng Gao, Ming-Ming Cheng, Steven C. H. Hoi
cs.AI
摘要
奖励模型是文生图后训练的核心,但视觉偏好具有主观性,更适合表示为评分分布的分布形式,而非确定性标量。现有的标量、分数令牌和成对奖励模型过度压缩了不确定性和细粒度分数差异,而基于推理的生成式奖励虽能提供更强判断力,但部署成本高且难以作为直接的优化信号使用。我们提出Z-Reward,一种解耦推理密集型判断与高效奖励部署的教师-学生奖励建模框架。该框架中的教师模型为大型VLM,通过推理推断对齐评分标准的分数分布,并采用分组直接分数优化(GDSO)进行训练——该方法将来自分布期望的策略梯度奖励与分数分布及分数差距上的直接逐点和成对监督相结合。学生模型通过推理内化分数蒸馏(RISD)进行训练,将教师基于推理的分数分布迁移至紧凑型VLM,无需在推理时显式生成推理链。在我们内部标注的评估集上,27B规模的GDSO教师模型达到89.6%的人类偏好准确率,优于SFT、RewardDance和GRPO;而9B规模的RISD学生模型达到88.6%,超越OPD基线并紧密匹配更大规模教师模型的性能。我们进一步证明,Z-Reward可作为可微分的奖励信号用于文生图优化,相比SFT基线带来41.3%的净人类偏好提升。
English
Reward models are central to text-to-image post-training, but visual preference is subjective and better represented as a distribution over rubric scores than as a deterministic scalar. Existing scalar, score-token, and pairwise reward models over-compress uncertainty and fine-grained score differences, while reasoning-based generative rewards provide stronger judgments but are costly to deploy and difficult to use as direct optimization signals. We propose Z-Reward, a teacher-student reward modeling framework that decouples reasoning-heavy judgment from efficient reward deployment. The teacher is a large VLM that uses reasoning to infer rubric-aligned score distributions, and is trained with Group-wise Direct Score Optimization (GDSO), which combines policy-gradient rewards from distribution expectations with direct pointwise and pairwise supervision on score distributions and score gaps. The student is trained with Reasoning-Internalized Score Distillation (RISD), which transfers the teacher's reasoning-conditioned score distribution into a compact VLM without requiring explicit reasoning chains at inference time. On our internally annotated evaluation set, the 27B GDSO teacher reaches 89.6% human preference accuracy, outperforming SFT, RewardDance, and GRPO, while the 9B RISD student reaches 88.6%, outperforming the OPD baseline and closely matching the larger teacher. We further show that Z-Reward can serve as a differentiable reward signal for text-to-image optimization, yielding a 41.3% net human-preference improvement over the SFT baseline.