OpenRubrics：迈向可扩展的评分标准生成，助力奖励建模与大语言模型对齐

摘要

奖励建模是强化学习从人类反馈（RLHF）的核心，然而现有的大多数奖励模型依赖于标量或成对判断，未能捕捉人类偏好的多维度特性。近期研究探索了“准则即奖励”（RaR）方法，它采用结构化的自然语言标准来捕捉响应质量的多个维度。然而，制定既可靠又可扩展的准则仍是一个关键挑战。在本研究中，我们推出了OpenRubrics，这是一个多样化的、大规模的（提示，准则）对集合，用于训练准则生成和基于准则的奖励模型。为了引出具有区分性和全面性的评估信号，我们引入了对比准则生成（CRG），通过对比优选和拒绝的响应，推导出硬性规则（显式约束）和原则（隐性品质）。我们进一步通过拒绝采样来强制偏好标签一致性，去除噪声准则，从而提升可靠性。在多个奖励建模基准测试中，我们的基于准则的奖励模型Rubric-RM超越了同等规模的基线模型6.8%。这些优势在指令遵循和生物医学基准测试中的策略模型上得到了转移。我们的结果表明，准则提供了可扩展的对齐信号，缩小了昂贵的人类评估与自动化奖励建模之间的差距，为LLM对齐开启了一种新的原则驱动范式。

English

Reward modeling lies at the core of reinforcement learning from human feedback (RLHF), yet most existing reward models rely on scalar or pairwise judgments that fail to capture the multifaceted nature of human preferences. Recent studies have explored rubrics-as-rewards (RaR) that uses structured natural language criteria that capture multiple dimensions of response quality. However, producing rubrics that are both reliable and scalable remains a key challenge. In this work, we introduce OpenRubrics, a diverse, large-scale collection of (prompt, rubric) pairs for training rubric-generation and rubric-based reward models. To elicit discriminative and comprehensive evaluation signals, we introduce Contrastive Rubric Generation (CRG), which derives both hard rules (explicit constraints) and principles (implicit qualities) by contrasting preferred and rejected responses. We further improve reliability by enforcing preference-label consistency via rejection sampling to remove noisy rubrics. Across multiple reward-modeling benchmarks, our rubric-based reward model, Rubric-RM, surpasses strong size-matched baselines by 6.8%. These gains transfer to policy models on instruction-following and biomedical benchmarks. Our results show that rubrics provide scalable alignment signals that narrow the gap between costly human evaluation and automated reward modeling, enabling a new principle-driven paradigm for LLM alignment.

OpenRubrics：迈向可扩展的评分标准生成，助力奖励建模与大语言模型对齐

OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment

摘要

Support