第1部：トリックか罠か？LLM推論のための深層強化学習の探求

要旨

大規模言語モデル（LLM）の推論における強化学習（RL）は、アルゴリズムの革新と実用的な応用の両面において関連研究が急増し、重要な研究領域として急速に台頭してきた。しかしながら、この進展にもかかわらず、RL技術を適用するための標準化されたガイドラインの欠如や、その基盤となるメカニズムに対する断片的な理解など、いくつかの重要な課題が残されている。さらに、実験設定の不整合、トレーニングデータのばらつき、モデルの初期化の違いなどが、相反する結論を引き起こし、これらの技術の主要な特性を不明瞭にし、適切な技術を選択する際に実践者間で混乱を招いている。本論文では、統一されたオープンソースフレームワーク内で、広く採用されているRL技術を厳密な再現と独立した評価を通じて体系的にレビューする。難易度の異なるデータセット、モデルサイズ、アーキテクチャを含む細粒度の実験を通じて、各技術の内部メカニズム、適用可能なシナリオ、および核心原理を分析する。これらの知見に基づき、特定の設定に合わせたRL技術を選択するための明確なガイドラインを提示し、LLM領域におけるRLを活用する実践者にとって信頼性のあるロードマップを提供する。最後に、2つの技術を最小限に組み合わせることで、バニラPPO損失を用いたクリティックフリーポリシーの学習能力を引き出せることを明らかにする。結果は、このシンプルな組み合わせが一貫して性能を向上させ、GRPOやDAPOなどの戦略を凌駕することを示している。

English

Reinforcement learning for LLM reasoning has rapidly emerged as a prominent research area, marked by a significant surge in related studies on both algorithmic innovations and practical applications. Despite this progress, several critical challenges remain, including the absence of standardized guidelines for employing RL techniques and a fragmented understanding of their underlying mechanisms. Additionally, inconsistent experimental settings, variations in training data, and differences in model initialization have led to conflicting conclusions, obscuring the key characteristics of these techniques and creating confusion among practitioners when selecting appropriate techniques. This paper systematically reviews widely adopted RL techniques through rigorous reproductions and isolated evaluations within a unified open-source framework. We analyze the internal mechanisms, applicable scenarios, and core principles of each technique through fine-grained experiments, including datasets of varying difficulty, model sizes, and architectures. Based on these insights, we present clear guidelines for selecting RL techniques tailored to specific setups, and provide a reliable roadmap for practitioners navigating the RL for the LLM domain. Finally, we reveal that a minimalist combination of two techniques can unlock the learning capability of critic-free policies using vanilla PPO loss. The results demonstrate that our simple combination consistently improves performance, surpassing strategies like GRPO and DAPO.

第1部：トリックか罠か？LLM推論のための深層強化学習の探求

Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning

要旨

Support