ルーブリックアンカーを用いた強化学習

要旨

検証可能な報酬からの強化学習（RLVR）は、大規模言語モデル（LLMs）を強化するための強力なパラダイムとして登場し、OpenAIのo-seriesの成功に代表されています。RLVRでは、報酬は検証可能な信号から導出されます。例えば、コード生成におけるユニットテストの合格や、数学的推論における正解との一致などです。このアプローチは効果的ですが、自動的にチェック可能な結果を持つ領域にRLVRを限定する傾向があります。この制約を克服するため、我々はRLVRパラダイムをオープンエンドなタスクに拡張し、ルーブリックベースの報酬を統合します。ここでは、慎重に設計されたルーブリックが構造化されたモデル解釈可能な基準として機能し、主観的な出力の自動採点を行います。我々は、これまでで最大のルーブリック報酬システムを構築し、人間、LLMs、または人間とLLMの協力による10,000以上のルーブリックを収集しました。ルーブリックベースのRLの実装は困難ですが、我々は明確なフレームワークを用いてこれらの課題に取り組み、オープンソースのQwen-30B-A3Bモデルを提示します。このモデルは以下のような顕著な成果を示しています：1) 5,000以上のサンプルだけで、我々のシステムはオープンエンドなベンチマーク（特に人文科学）で+5.2%の改善を示し、671BのDeepSeek-V3モデルを+2.4%上回りながら、一般的な能力と推論能力を維持します。2) 我々の方法は、ルーブリックをアンカーとして使用し、「AIらしい」トーンを軽減し、より人間らしい表現豊かな応答を生成するための細かいスタイル制御を提供します。我々は、ルーブリックの構築、データ選択、トレーニングにおける重要な教訓を共有し、制限事項と今後のリリースについて議論します。

English

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing Large Language Models (LLMs), exemplified by the success of OpenAI's o-series. In RLVR, rewards are derived from verifiable signals-such as passing unit tests in code generation or matching correct answers in mathematical reasoning. While effective, this requirement largely confines RLVR to domains with automatically checkable outcomes. To overcome this, we extend the RLVR paradigm to open-ended tasks by integrating rubric-based rewards, where carefully designed rubrics serve as structured, model-interpretable criteria for automatic scoring of subjective outputs. We construct, to our knowledge, the largest rubric reward system to date, with over 10,000 rubrics from humans, LLMs, or a hybrid human-LLM collaboration. Implementing rubric-based RL is challenging; we tackle these issues with a clear framework and present an open-sourced Qwen-30B-A3B model with notable gains: 1) With only 5K+ samples, our system improves by +5.2% on open-ended benchmarks (especially humanities), outperforming a 671B DeepSeek-V3 model by +2.4%, while preserving general and reasoning abilities. 2) Our method provides fine-grained stylistic control, using rubrics as anchors to mitigate the "AI-like" tone and produce more human-like, expressive responses. We share key lessons in rubric construction, data selection, and training, and discuss limitations and future releases.

ルーブリックアンカーを用いた強化学習

Reinforcement Learning with Rubric Anchors

要旨

Support