루브릭 앵커를 활용한 강화 학습

초록

검증 가능한 보상 기반 강화 학습(RLVR)은 OpenAI의 o-시리즈의 성공 사례에서 볼 수 있듯이, 대규모 언어 모델(LLM)을 향상시키는 강력한 패러다임으로 부상했습니다. RLVR에서는 코드 생성에서의 단위 테스트 통과나 수학적 추론에서의 정답 일치와 같은 검증 가능한 신호로부터 보상이 도출됩니다. 이 방법은 효과적이지만, 이러한 요구 사항은 RLVR을 자동으로 확인 가능한 결과가 있는 영역으로 크게 제한합니다. 이를 극복하기 위해, 우리는 RLVR 패러다임을 개방형 작업으로 확장하기 위해 루브릭 기반 보상을 통합합니다. 여기서 신중하게 설계된 루브릭은 주관적 출력물에 대한 자동 점수 매기기를 위한 구조화된, 모델이 해석 가능한 기준으로 작용합니다. 우리는 현재까지 가장 큰 규모의 루브릭 보상 시스템을 구축했으며, 인간, LLM 또는 인간-LLM 협업을 통해 10,000개 이상의 루브릭을 생성했습니다. 루브릭 기반 RL을 구현하는 것은 도전적이지만, 우리는 명확한 프레임워크로 이러한 문제를 해결하고, 주목할 만한 성과를 보인 오픈소스 Qwen-30B-A3B 모델을 공개합니다: 1) 5,000개 이상의 샘플만으로도 우리 시스템은 개방형 벤치마크(특히 인문학 분야)에서 +5.2%의 향상을 보이며, 671B DeepSeek-V3 모델을 +2.4% 앞서면서 일반적이고 추론적인 능력을 유지합니다. 2) 우리의 방법은 세밀한 스타일 제어를 제공하며, 루브릭을 앵커로 사용하여 "AI 같은" 어조를 완화하고 더 인간적이고 표현력 있는 응답을 생성합니다. 우리는 루브릭 구축, 데이터 선택, 훈련에서의 주요 교훈을 공유하고, 한계와 향후 출시에 대해 논의합니다.

English

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing Large Language Models (LLMs), exemplified by the success of OpenAI's o-series. In RLVR, rewards are derived from verifiable signals-such as passing unit tests in code generation or matching correct answers in mathematical reasoning. While effective, this requirement largely confines RLVR to domains with automatically checkable outcomes. To overcome this, we extend the RLVR paradigm to open-ended tasks by integrating rubric-based rewards, where carefully designed rubrics serve as structured, model-interpretable criteria for automatic scoring of subjective outputs. We construct, to our knowledge, the largest rubric reward system to date, with over 10,000 rubrics from humans, LLMs, or a hybrid human-LLM collaboration. Implementing rubric-based RL is challenging; we tackle these issues with a clear framework and present an open-sourced Qwen-30B-A3B model with notable gains: 1) With only 5K+ samples, our system improves by +5.2% on open-ended benchmarks (especially humanities), outperforming a 671B DeepSeek-V3 model by +2.4%, while preserving general and reasoning abilities. 2) Our method provides fine-grained stylistic control, using rubrics as anchors to mitigate the "AI-like" tone and produce more human-like, expressive responses. We share key lessons in rubric construction, data selection, and training, and discuss limitations and future releases.

루브릭 앵커를 활용한 강화 학습

Reinforcement Learning with Rubric Anchors

초록

Support