ChatPaper.aiChatPaper

OpenRubrics:邁向可擴展的合成評量標準生成,以獎勵建模與大型語言模型對齊

OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment

October 9, 2025
作者: Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, Haoyu Wang
cs.AI

摘要

獎勵建模位於從人類反饋中進行強化學習(RLHF)的核心,然而現有的大多數獎勵模型依賴於標量或成對判斷,未能捕捉到人類偏好的多面性。近期的研究探索了使用結構化自然語言標準來捕捉回應質量多個維度的“評分標準即獎勵”(RaR)方法。然而,生成既可靠又可擴展的評分標準仍是一個關鍵挑戰。在本研究中,我們引入了OpenRubrics,這是一個多樣化、大規模的(提示,評分標準)對集合,用於訓練評分標準生成及基於評分標準的獎勵模型。為了引出具有區分性和全面性的評估信號,我們提出了對比評分標準生成(CRG),該方法通過對比偏好與被拒回應,推導出硬性規則(明確約束)和原則(隱含品質)。我們進一步通過拒絕採樣來強制偏好標籤一致性,以去除噪聲評分標準,從而提高可靠性。在多個獎勵建模基準測試中,我們基於評分標準的獎勵模型Rubric-RM超越了同等規模的強基線模型6.8%。這些增益轉化為指令遵循和生物醫學基準測試中的策略模型。我們的結果表明,評分標準提供了可擴展的對齊信號,縮小了昂貴的人類評估與自動化獎勵建模之間的差距,為大語言模型(LLM)對齊開闢了一條新的原則驅動範式。
English
Reward modeling lies at the core of reinforcement learning from human feedback (RLHF), yet most existing reward models rely on scalar or pairwise judgments that fail to capture the multifaceted nature of human preferences. Recent studies have explored rubrics-as-rewards (RaR) that uses structured natural language criteria that capture multiple dimensions of response quality. However, producing rubrics that are both reliable and scalable remains a key challenge. In this work, we introduce OpenRubrics, a diverse, large-scale collection of (prompt, rubric) pairs for training rubric-generation and rubric-based reward models. To elicit discriminative and comprehensive evaluation signals, we introduce Contrastive Rubric Generation (CRG), which derives both hard rules (explicit constraints) and principles (implicit qualities) by contrasting preferred and rejected responses. We further improve reliability by enforcing preference-label consistency via rejection sampling to remove noisy rubrics. Across multiple reward-modeling benchmarks, our rubric-based reward model, Rubric-RM, surpasses strong size-matched baselines by 6.8%. These gains transfer to policy models on instruction-following and biomedical benchmarks. Our results show that rubrics provide scalable alignment signals that narrow the gap between costly human evaluation and automated reward modeling, enabling a new principle-driven paradigm for LLM alignment.
PDF72October 10, 2025