추론을 점수 분포에 내재화하여 스칼라 보상을 넘어서기

초록

보상 모델은 텍스트-이미지 사후 학습의 핵심이지만, 시각적 선호도는 주관적이며 결정론적 스칼라보다는 루브릭 점수 분포로 표현하는 것이 더 적합하다. 기존의 스칼라, 점수 토큰, 쌍별 보상 모델은 불확실성과 세부 점수 차이를 과도하게 압축하는 반면, 추론 기반 생성적 보상은 더 강력한 판단을 제공하지만 배포 비용이 많이 들고 직접적인 최적화 신호로 사용하기 어렵다. 본 논문에서는 추론 중심의 판단과 효율적인 보상 배포를 분리하는 교사-학생 보상 모델링 프레임워크인 Z-Reward를 제안한다. 교사는 추론을 통해 루브릭에 정렬된 점수 분포를 추론하는 대규모 VLM이며, 그룹별 직접 점수 최적화(GDSO)로 학습된다. GDSO는 분포 기대값에서 얻은 정책 경사 보상과 점수 분포 및 점수 차이에 대한 직접적인 점별 및 쌍별 감독을 결합한다. 학생은 추론 내재화 점수 증류(RISD)로 학습되며, 교사의 추론 조건부 점수 분포를 추론 체인 없이도 추론 시점에 사용할 수 있는 소형 VLM으로 전이한다. 내부 주석 평가 세트에서 27B GDSO 교사는 89.6%의 인간 선호 정확도를 달성하여 SFT, RewardDance, GRPO를 능가했으며, 9B RISD 학생은 88.6%를 달성하여 OPD 기준선을 능가하고 더 큰 교사 모델과 근접한 성능을 보였다. 또한 Z-Reward가 텍스트-이미지 최적화를 위한 미분 가능한 보상 신호로 사용될 수 있음을 보여주며, SFT 기준선 대비 41.3%의 순 인간 선호도 향상을 제공한다.

English

Reward models are central to text-to-image post-training, but visual preference is subjective and better represented as a distribution over rubric scores than as a deterministic scalar. Existing scalar, score-token, and pairwise reward models over-compress uncertainty and fine-grained score differences, while reasoning-based generative rewards provide stronger judgments but are costly to deploy and difficult to use as direct optimization signals. We propose Z-Reward, a teacher-student reward modeling framework that decouples reasoning-heavy judgment from efficient reward deployment. The teacher is a large VLM that uses reasoning to infer rubric-aligned score distributions, and is trained with Group-wise Direct Score Optimization (GDSO), which combines policy-gradient rewards from distribution expectations with direct pointwise and pairwise supervision on score distributions and score gaps. The student is trained with Reasoning-Internalized Score Distillation (RISD), which transfers the teacher's reasoning-conditioned score distribution into a compact VLM without requiring explicit reasoning chains at inference time. On our internally annotated evaluation set, the 27B GDSO teacher reaches 89.6% human preference accuracy, outperforming SFT, RewardDance, and GRPO, while the 9B RISD student reaches 88.6%, outperforming the OPD baseline and closely matching the larger teacher. We further show that Z-Reward can serve as a differentiable reward signal for text-to-image optimization, yielding a 41.3% net human-preference improvement over the SFT baseline.