探索のボトルネックを打破する：ルーブリックに基づく足場かけ強化学習による汎用LLM推論

要旨

大規模言語モデル（LLMs）の最近の進展は、推論能力の出現を促進するための強化学習（RL）の可能性を強調している。しかしながら、RLの改善は高品質なサンプルからの学習に依存しているにもかかわらず、そのようなサンプルの探索はLLMsの本質的な制約によって制限されている。これにより、探索できないものは学習できないという望ましくない循環が生じている。本研究では、一般的なLLM推論における探索のボトルネックを打破するための新しい指導的足場付けフレームワークである「Rubric-Scaffolded Reinforcement Learning（RuscaRL）」を提案する。具体的には、RuscaRLはチェックリスト形式のルーブリックを導入し、(1) ロールアウト生成中の探索に対する明示的な足場付けとして、異なるルーブリックをタスク指示内の外部ガイダンスとして提供し、多様な高品質な応答を導く。このガイダンスは時間とともに徐々に減衰し、モデルが基礎となる推論パターンを内在化することを促す。(2) モデル訓練中の利用に対する検証可能な報酬として、ルーブリックを参照として使用することで、堅牢なLLM-as-a-Judgeスコアを取得し、一般的な推論タスクにおける効果的なRLを可能にする。広範な実験により、提案されたRuscaRLが様々なベンチマークで優位性を示し、best-of-N評価の下で推論の境界を効果的に拡大することが実証された。特に、RuscaRLはHealthBench-500においてQwen-2.5-7B-Instructを23.6から50.3に大幅に向上させ、GPT-4.1を凌駕した。さらに、Qwen3-30B-A3B-Instructに対する微調整バリアントはHealthBench-500で61.1を達成し、OpenAI-o3を含む主要なLLMsを上回った。

English

Recent advances in Large Language Models (LLMs) have underscored the potential of Reinforcement Learning (RL) to facilitate the emergence of reasoning capabilities. Despite the encouraging results, a fundamental dilemma persists as RL improvement relies on learning from high-quality samples, yet the exploration for such samples remains bounded by the inherent limitations of LLMs. This, in effect, creates an undesirable cycle in which what cannot be explored cannot be learned. In this work, we propose Rubric-Scaffolded Reinforcement Learning (RuscaRL), a novel instructional scaffolding framework designed to break the exploration bottleneck for general LLM reasoning. Specifically, RuscaRL introduces checklist-style rubrics as (1) explicit scaffolding for exploration during rollout generation, where different rubrics are provided as external guidance within task instructions to steer diverse high-quality responses. This guidance is gradually decayed over time, encouraging the model to internalize the underlying reasoning patterns; (2) verifiable rewards for exploitation during model training, where we can obtain robust LLM-as-a-Judge scores using rubrics as references, enabling effective RL on general reasoning tasks. Extensive experiments demonstrate the superiority of the proposed RuscaRL across various benchmarks, effectively expanding reasoning boundaries under the best-of-N evaluation. Notably, RuscaRL significantly boosts Qwen-2.5-7B-Instruct from 23.6 to 50.3 on HealthBench-500, surpassing GPT-4.1. Furthermore, our fine-tuned variant on Qwen3-30B-A3B-Instruct achieves 61.1 on HealthBench-500, outperforming leading LLMs including OpenAI-o3.

探索のボトルネックを打破する：ルーブリックに基づく足場かけ強化学習による汎用LLM推論

Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning

要旨

Support