ChatPaper.aiChatPaper

第一部分:技巧还是陷阱?深入探讨大语言模型推理中的强化学习

Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning

August 11, 2025
作者: Zihe Liu, Jiashun Liu, Yancheng He, Weixun Wang, Jiaheng Liu, Ling Pan, Xinyu Hu, Shaopan Xiong, Ju Huang, Jian Hu, Shengyi Huang, Siran Yang, Jiamang Wang, Wenbo Su, Bo Zheng
cs.AI

摘要

针对大语言模型(LLM)推理的强化学习已迅速成为一个重要的研究领域,相关研究在算法创新与实际应用方面均呈现出显著增长。尽管如此,该领域仍面临若干关键挑战,包括缺乏运用强化学习技术的标准化指南,以及对其内在机制的理解尚不系统。此外,实验设置的不一致、训练数据的差异及模型初始化的不同,导致了结论间的矛盾,模糊了这些技术的核心特征,使实践者在选择合适技术时感到困惑。本文通过在一个统一的开源框架内进行严格的复现与独立评估,系统回顾了广泛采用的强化学习技术。我们通过细粒度实验,包括不同难度数据集、模型规模及架构,深入分析了每种技术的内在机制、适用场景与核心原理。基于这些洞见,我们为特定场景下的技术选择提供了清晰的指导,并为LLM领域强化学习的实践者绘制了一份可靠的路线图。最后,我们发现,仅需两种技术的简约组合,便能利用标准PPO损失解锁无评论家策略的学习能力。实验结果表明,这一简单组合持续提升了性能,超越了GRPO与DAPO等策略。
English
Reinforcement learning for LLM reasoning has rapidly emerged as a prominent research area, marked by a significant surge in related studies on both algorithmic innovations and practical applications. Despite this progress, several critical challenges remain, including the absence of standardized guidelines for employing RL techniques and a fragmented understanding of their underlying mechanisms. Additionally, inconsistent experimental settings, variations in training data, and differences in model initialization have led to conflicting conclusions, obscuring the key characteristics of these techniques and creating confusion among practitioners when selecting appropriate techniques. This paper systematically reviews widely adopted RL techniques through rigorous reproductions and isolated evaluations within a unified open-source framework. We analyze the internal mechanisms, applicable scenarios, and core principles of each technique through fine-grained experiments, including datasets of varying difficulty, model sizes, and architectures. Based on these insights, we present clear guidelines for selecting RL techniques tailored to specific setups, and provide a reliable roadmap for practitioners navigating the RL for the LLM domain. Finally, we reveal that a minimalist combination of two techniques can unlock the learning capability of critic-free policies using vanilla PPO loss. The results demonstrate that our simple combination consistently improves performance, surpassing strategies like GRPO and DAPO.
PDF263August 12, 2025