ChatPaper.aiChatPaper

基于评分锚点的强化学习

Reinforcement Learning with Rubric Anchors

August 18, 2025
作者: Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, Xijun Gu, Peiyi Tu, Jiaxin Liu, Wenyu Chen, Yuzhuo Fu, Zhiting Fan, Yanmei Gu, Yuanyuan Wang, Zhengkai Yang, Jianguo Li, Junbo Zhao
cs.AI

摘要

基于可验证奖励的强化学习(RLVR)已成为增强大型语言模型(LLMs)的强大范式,OpenAI的o系列模型便是其成功典范。在RLVR中,奖励源自可验证的信号——例如代码生成中通过单元测试或数学推理中匹配正确答案。尽管有效,这一要求很大程度上将RLVR局限于具有自动可检查结果的领域。为突破此限制,我们通过整合基于量规的奖励,将RLVR范式扩展至开放式任务,其中精心设计的量规作为结构化、模型可解释的标准,用于自动评分主观输出。我们构建了迄今为止最大的量规奖励系统,包含超过10,000个由人类、LLMs或人机协作生成的量规。实施基于量规的强化学习颇具挑战;我们通过清晰的框架应对这些问题,并开源了Qwen-30B-A3B模型,取得了显著成效:1)仅用5,000+样本,我们的系统在开放式基准测试(尤其是人文学科)上提升了+5.2%,以+2.4%的优势超越671B的DeepSeek-V3模型,同时保持了一般和推理能力。2)我们的方法提供了细粒度的风格控制,利用量规作为锚点,减轻“AI腔调”,生成更人性化、富有表现力的回答。我们分享了量规构建、数据选择和训练的关键经验,并讨论了局限性和未来发布计划。
English
Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing Large Language Models (LLMs), exemplified by the success of OpenAI's o-series. In RLVR, rewards are derived from verifiable signals-such as passing unit tests in code generation or matching correct answers in mathematical reasoning. While effective, this requirement largely confines RLVR to domains with automatically checkable outcomes. To overcome this, we extend the RLVR paradigm to open-ended tasks by integrating rubric-based rewards, where carefully designed rubrics serve as structured, model-interpretable criteria for automatic scoring of subjective outputs. We construct, to our knowledge, the largest rubric reward system to date, with over 10,000 rubrics from humans, LLMs, or a hybrid human-LLM collaboration. Implementing rubric-based RL is challenging; we tackle these issues with a clear framework and present an open-sourced Qwen-30B-A3B model with notable gains: 1) With only 5K+ samples, our system improves by +5.2% on open-ended benchmarks (especially humanities), outperforming a 671B DeepSeek-V3 model by +2.4%, while preserving general and reasoning abilities. 2) Our method provides fine-grained stylistic control, using rubrics as anchors to mitigate the "AI-like" tone and produce more human-like, expressive responses. We share key lessons in rubric construction, data selection, and training, and discuss limitations and future releases.
PDF132August 19, 2025