ChatPaper.aiChatPaper

基於評分錨點的強化學習

Reinforcement Learning with Rubric Anchors

August 18, 2025
作者: Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, Xijun Gu, Peiyi Tu, Jiaxin Liu, Wenyu Chen, Yuzhuo Fu, Zhiting Fan, Yanmei Gu, Yuanyuan Wang, Zhengkai Yang, Jianguo Li, Junbo Zhao
cs.AI

摘要

基於可驗證獎勵的強化學習(RLVR)已成為增強大型語言模型(LLMs)的強大範式,OpenAI的o系列便是其成功典範。在RLVR中,獎勵源自可驗證的信號——例如在代碼生成中通過單元測試,或在數學推理中匹配正確答案。儘管有效,這一要求很大程度上將RLVR限制在具有自動可檢查結果的領域。為克服此限制,我們通過整合基於評分標準的獎勵,將RLVR範式擴展至開放式任務,其中精心設計的評分標準作為結構化、模型可解釋的標準,用於自動評分主觀輸出。我們構建了迄今為止最大的評分標準獎勵系統,包含超過10,000條來自人類、LLMs或人機協作的評分標準。實施基於評分標準的強化學習具有挑戰性;我們通過清晰的框架解決這些問題,並推出開源的Qwen-30B-A3B模型,取得了顯著成果:1)僅使用5,000多個樣本,我們的系統在開放式基準測試(尤其是人文領域)上提升了+5.2%,以+2.4%的優勢超越了671B的DeepSeek-V3模型,同時保留了通用和推理能力。2)我們的方法提供了細粒度的風格控制,利用評分標準作為錨點,減輕“AI式”語調,生成更人性化、富有表現力的回應。我們分享了評分標準構建、數據選擇和訓練中的關鍵經驗,並討論了限制和未來發布計劃。
English
Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing Large Language Models (LLMs), exemplified by the success of OpenAI's o-series. In RLVR, rewards are derived from verifiable signals-such as passing unit tests in code generation or matching correct answers in mathematical reasoning. While effective, this requirement largely confines RLVR to domains with automatically checkable outcomes. To overcome this, we extend the RLVR paradigm to open-ended tasks by integrating rubric-based rewards, where carefully designed rubrics serve as structured, model-interpretable criteria for automatic scoring of subjective outputs. We construct, to our knowledge, the largest rubric reward system to date, with over 10,000 rubrics from humans, LLMs, or a hybrid human-LLM collaboration. Implementing rubric-based RL is challenging; we tackle these issues with a clear framework and present an open-sourced Qwen-30B-A3B model with notable gains: 1) With only 5K+ samples, our system improves by +5.2% on open-ended benchmarks (especially humanities), outperforming a 671B DeepSeek-V3 model by +2.4%, while preserving general and reasoning abilities. 2) Our method provides fine-grained stylistic control, using rubrics as anchors to mitigate the "AI-like" tone and produce more human-like, expressive responses. We share key lessons in rubric construction, data selection, and training, and discuss limitations and future releases.
PDF62August 19, 2025