超越標量獎勵:將推理內化至評分分佈
Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions
June 8, 2026
作者: Xin Jin, Huanqia Cai, Zhen Li, Zechao Zhan, Dengyang Jiang, Aiming Hao, Yuming Jiang, Chunle Guo, Peng Gao, Ming-Ming Cheng, Steven C. H. Hoi
cs.AI
摘要
獎勵模型是文生圖後訓練的核心,但視覺偏好具有主觀性,將其表示為評分標準分數的分佈,遠比用確定性標量來表達更為恰當。現有的標量、分數標記和成對獎勵模型過度壓縮了不確定性與細粒度分數差異,而基於推理的生成式獎勵雖能提供更強的判斷,但部署成本高昂且難以直接用作優化信號。我們提出 Z-Reward,這是一個教師-學生獎勵建模框架,將重度推理判斷與高效獎勵部署加以解耦。教師模型為大型視覺語言模型(VLM),利用推理來推斷符合評分標準的分數分佈,並通過分組直接分數優化(GDSO)訓練;該方法將來自分佈期望的策略梯度獎勵與對分數分佈及分數差距的直接點式與成對監督相結合。學生模型則通過推理內化分數蒸餾(RISD)訓練,將教師的推理條件分數分佈轉移到緊湊型視覺語言模型中,而無需在推理時使用顯式推理鏈。在我們內部標註的評估集上,27B 的 GDSO 教師模型達到了 89.6% 的人類偏好準確率,優於 SFT、RewardDance 和 GRPO;而 9B 的 RISD 學生模型達到 88.6%,不僅優於 OPD 基線,且與規模更大的教師模型表現相近。我們進一步展示,Z-Reward 可作為文生圖優化的可微分獎勵信號,相較於 SFT 基線,實現了 41.3% 的人類偏好淨提升。
English
Reward models are central to text-to-image post-training, but visual preference is subjective and better represented as a distribution over rubric scores than as a deterministic scalar. Existing scalar, score-token, and pairwise reward models over-compress uncertainty and fine-grained score differences, while reasoning-based generative rewards provide stronger judgments but are costly to deploy and difficult to use as direct optimization signals. We propose Z-Reward, a teacher-student reward modeling framework that decouples reasoning-heavy judgment from efficient reward deployment. The teacher is a large VLM that uses reasoning to infer rubric-aligned score distributions, and is trained with Group-wise Direct Score Optimization (GDSO), which combines policy-gradient rewards from distribution expectations with direct pointwise and pairwise supervision on score distributions and score gaps. The student is trained with Reasoning-Internalized Score Distillation (RISD), which transfers the teacher's reasoning-conditioned score distribution into a compact VLM without requiring explicit reasoning chains at inference time. On our internally annotated evaluation set, the 27B GDSO teacher reaches 89.6% human preference accuracy, outperforming SFT, RewardDance, and GRPO, while the 9B RISD student reaches 88.6%, outperforming the OPD baseline and closely matching the larger teacher. We further show that Z-Reward can serve as a differentiable reward signal for text-to-image optimization, yielding a 41.3% net human-preference improvement over the SFT baseline.