OpenRubrics: 보상 모델링 및 대형 언어 모델 정렬을 위한 확장 가능한 합성 루브릭 생성 방향

초록

보상 모델링은 인간 피드백을 통한 강화 학습(RLHF)의 핵심에 있지만, 기존의 대부분의 보상 모델은 인간 선호의 다면적 특성을 포착하지 못하는 스칼라 또는 쌍별 판단에 의존합니다. 최근 연구에서는 응답 품질의 여러 차원을 포착하는 구조화된 자연어 기준을 사용하는 루브릭-에즈-리워드(RaR)를 탐구했습니다. 그러나 신뢰할 수 있고 확장 가능한 루브릭을 생성하는 것은 여전히 주요 과제로 남아 있습니다. 본 연구에서는 루브릭 생성 및 루브릭 기반 보상 모델을 훈련하기 위한 다양한 대규모 (프롬프트, 루브릭) 쌍 컬렉션인 OpenRubrics를 소개합니다. 차별적이고 포괄적인 평가 신호를 유도하기 위해, 우리는 선호된 응답과 거부된 응답을 대조하여 명시적 제약 조건(하드 규칙)과 암묵적 품질(원칙)을 도출하는 대조적 루브릭 생성(CRG)을 도입했습니다. 또한, 노이즈가 있는 루브릭을 제거하기 위해 거부 샘플링을 통해 선호 레이블 일관성을 강화하여 신뢰성을 더욱 개선했습니다. 여러 보상 모델링 벤치마크에서, 우리의 루브릭 기반 보상 모델인 Rubric-RM은 강력한 크기 대조 기준을 6.8% 능가했습니다. 이러한 성과는 명령어 수행 및 생물의학 벤치마크에서 정책 모델로 이전됩니다. 우리의 결과는 루브릭이 비용이 많이 드는 인간 평가와 자동화된 보상 모델링 사이의 격차를 좁히는 확장 가능한 정렬 신호를 제공하며, LLM 정렬을 위한 새로운 원칙 기반 패러다임을 가능하게 함을 보여줍니다.

English

Reward modeling lies at the core of reinforcement learning from human feedback (RLHF), yet most existing reward models rely on scalar or pairwise judgments that fail to capture the multifaceted nature of human preferences. Recent studies have explored rubrics-as-rewards (RaR) that uses structured natural language criteria that capture multiple dimensions of response quality. However, producing rubrics that are both reliable and scalable remains a key challenge. In this work, we introduce OpenRubrics, a diverse, large-scale collection of (prompt, rubric) pairs for training rubric-generation and rubric-based reward models. To elicit discriminative and comprehensive evaluation signals, we introduce Contrastive Rubric Generation (CRG), which derives both hard rules (explicit constraints) and principles (implicit qualities) by contrasting preferred and rejected responses. We further improve reliability by enforcing preference-label consistency via rejection sampling to remove noisy rubrics. Across multiple reward-modeling benchmarks, our rubric-based reward model, Rubric-RM, surpasses strong size-matched baselines by 6.8%. These gains transfer to policy models on instruction-following and biomedical benchmarks. Our results show that rubrics provide scalable alignment signals that narrow the gap between costly human evaluation and automated reward modeling, enabling a new principle-driven paradigm for LLM alignment.

OpenRubrics: 보상 모델링 및 대형 언어 모델 정렬을 위한 확장 가능한 합성 루브릭 생성 방향

OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment

초록

Support