RUBRIC-ARROW: 검증 불가능한 도메인에서 LLM 사후 훈련을 위한 교대적 점별 루브릭 보상 모델링

초록

점별 보상 모델링은 LLM 후학습에 중요한 신호를 제공하지만, 주관적이고 검증이 불가능한 환경에서는 절대적 점수 산정에 어려움을 겪는다. 루브릭 기반 방법은 평가를 명시적 기준으로 분해하여 이러한 문제를 해결하지만, 기존 접근법은 일반적으로 최첨단 LLM에 의존하고 엄격한 부울 집계로 인해 동점 문제가 발생한다. 본 논문에서는 교대 프레임워크인 RUBRIC-ARROW를 제안한다. 이는 루브릭 생성기와 루브릭 조건부 판별기를 공동으로 학습하며, 강화학습 단계에서는 쌍별 선호 데이터만을 사용한다. 제안 방법은 동점을 줄이는 확률 기반 점수 규칙과 단계별 선호 기반 보상, 그리고 점별 평가기를 함께 학습하는 교대 GRPO 기법을 결합한다. 광범위한 실험을 통해 RUBRIC-ARROW가 경쟁력 있는 보상 모델링 정확도를 달성하고, 하위 정책 후학습에서 일관된 성능 향상을 제공함을 보여준다.

English

Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit criteria, but existing approaches typically depend on frontier LLMs and suffer from ties caused by hard Boolean aggregation. We present RUBRIC-ARROW, an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge, with its RL stage using only pairwise preference data. Our method couples a probability-based scoring rule that reduces ties with phase-specific preference-based rewards and an alternating GRPO scheme that together train the pointwise evaluator. Extensive experiments show that RUBRIC-ARROW achieves competitive reward-modeling accuracy and yields consistent gains for downstream policy post-training.