OpenRubrics:迈向可扩展的评分标准生成,助力奖励建模与大语言模型对齐
OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment
October 9, 2025
作者: Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, Haoyu Wang
cs.AI
摘要
奖励建模是强化学习从人类反馈(RLHF)的核心,然而现有的大多数奖励模型依赖于标量或成对判断,未能捕捉人类偏好的多维度特性。近期研究探索了“准则即奖励”(RaR)方法,它采用结构化的自然语言标准来捕捉响应质量的多个维度。然而,制定既可靠又可扩展的准则仍是一个关键挑战。在本研究中,我们推出了OpenRubrics,这是一个多样化的、大规模的(提示,准则)对集合,用于训练准则生成和基于准则的奖励模型。为了引出具有区分性和全面性的评估信号,我们引入了对比准则生成(CRG),通过对比优选和拒绝的响应,推导出硬性规则(显式约束)和原则(隐性品质)。我们进一步通过拒绝采样来强制偏好标签一致性,去除噪声准则,从而提升可靠性。在多个奖励建模基准测试中,我们的基于准则的奖励模型Rubric-RM超越了同等规模的基线模型6.8%。这些优势在指令遵循和生物医学基准测试中的策略模型上得到了转移。我们的结果表明,准则提供了可扩展的对齐信号,缩小了昂贵的人类评估与自动化奖励建模之间的差距,为LLM对齐开启了一种新的原则驱动范式。
English
Reward modeling lies at the core of reinforcement learning from human
feedback (RLHF), yet most existing reward models rely on scalar or pairwise
judgments that fail to capture the multifaceted nature of human preferences.
Recent studies have explored rubrics-as-rewards (RaR) that uses structured
natural language criteria that capture multiple dimensions of response quality.
However, producing rubrics that are both reliable and scalable remains a key
challenge. In this work, we introduce OpenRubrics, a diverse, large-scale
collection of (prompt, rubric) pairs for training rubric-generation and
rubric-based reward models. To elicit discriminative and comprehensive
evaluation signals, we introduce Contrastive Rubric Generation (CRG), which
derives both hard rules (explicit constraints) and principles (implicit
qualities) by contrasting preferred and rejected responses. We further improve
reliability by enforcing preference-label consistency via rejection sampling to
remove noisy rubrics. Across multiple reward-modeling benchmarks, our
rubric-based reward model, Rubric-RM, surpasses strong size-matched baselines
by 6.8%. These gains transfer to policy models on instruction-following and
biomedical benchmarks. Our results show that rubrics provide scalable alignment
signals that narrow the gap between costly human evaluation and automated
reward modeling, enabling a new principle-driven paradigm for LLM alignment.