スカラー報酬を超えて：推論をスコア分布に内在化する

要旨

報酬モデルはテキストから画像へのポストトレーニングにおいて中心的な役割を果たすが、視覚的な嗜好は主観的であり、決定論的スカラーよりもルーブリックスコアの分布として表現する方が適切である。既存のスカラー型、スコアトークン型、ペアワイズ報酬モデルは不確実性や細かいスコア差を過度に圧縮する一方、推論ベースの生成的報酬はより強力な判断を提供するが、導入コストが高く、直接的な最適化信号として利用しづらい。本稿では、推論負荷の高い判断と効率的な報酬展開を分離する教師-生徒報酬モデリングフレームワークであるZ-Rewardを提案する。教師は大規模VLMであり、推論を用いてルーブリックに沿ったスコア分布を推定し、グループ別直接スコア最適化（GDSO）により訓練される。GDSOは分布期待値からの方策勾配報酬と、スコア分布およびスコア差に対する点別・ペアワイズの直接的な教師信号を組み合わせる。生徒は推論内在化スコア蒸留（RISD）により訓練され、教師の推論条件付きスコア分布を、推論連鎖を明示的に必要としないコンパクトなVLMへ転移する。内部でアノテーションした評価セットにおいて、270億パラメータのGDSO教師は89.6%の人間嗜好一致率を達成し、SFT、RewardDance、GRPOを上回った。一方、90億パラメータのRISD生徒は88.6%を達成し、OPDベースラインを上回り、より大規模な教師に匹敵する。さらに、Z-Rewardがテキストから画像への最適化における微分可能な報酬信号として機能し、SFTベースラインに対して41.3%の正味の人間嗜好改善をもたらすことを示す。

English

Reward models are central to text-to-image post-training, but visual preference is subjective and better represented as a distribution over rubric scores than as a deterministic scalar. Existing scalar, score-token, and pairwise reward models over-compress uncertainty and fine-grained score differences, while reasoning-based generative rewards provide stronger judgments but are costly to deploy and difficult to use as direct optimization signals. We propose Z-Reward, a teacher-student reward modeling framework that decouples reasoning-heavy judgment from efficient reward deployment. The teacher is a large VLM that uses reasoning to infer rubric-aligned score distributions, and is trained with Group-wise Direct Score Optimization (GDSO), which combines policy-gradient rewards from distribution expectations with direct pointwise and pairwise supervision on score distributions and score gaps. The student is trained with Reasoning-Internalized Score Distillation (RISD), which transfers the teacher's reasoning-conditioned score distribution into a compact VLM without requiring explicit reasoning chains at inference time. On our internally annotated evaluation set, the 27B GDSO teacher reaches 89.6% human preference accuracy, outperforming SFT, RewardDance, and GRPO, while the 9B RISD student reaches 88.6%, outperforming the OPD baseline and closely matching the larger teacher. We further show that Z-Reward can serve as a differentiable reward signal for text-to-image optimization, yielding a 41.3% net human-preference improvement over the SFT baseline.