ChatPaper.aiChatPaper

第一部分:技巧抑或陷阱?深入探討強化學習於大語言模型推理之應用

Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning

August 11, 2025
作者: Zihe Liu, Jiashun Liu, Yancheng He, Weixun Wang, Jiaheng Liu, Ling Pan, Xinyu Hu, Shaopan Xiong, Ju Huang, Jian Hu, Shengyi Huang, Siran Yang, Jiamang Wang, Wenbo Su, Bo Zheng
cs.AI

摘要

強化學習在大型語言模型推理中的應用已迅速崛起,成為一個顯著的研究領域,相關研究在算法創新與實際應用方面均呈現顯著增長。儘管取得了一定進展,但仍存在若干關鍵挑戰,包括缺乏運用強化學習技術的標準化指南,以及對其內在機制的理解尚顯零散。此外,實驗設置的不一致、訓練數據的變異以及模型初始化的差異,導致了相互矛盾的結論,模糊了這些技術的核心特徵,並使實踐者在選擇合適技術時感到困惑。本文通過在統一的開源框架內進行嚴謹的複現與獨立評估,系統性地回顧了廣泛採用的強化學習技術。我們通過細粒度實驗,包括不同難度的數據集、模型規模與架構,分析了每種技術的內部機制、適用場景與核心原則。基於這些洞察,我們為特定設置量身定制了選擇強化學習技術的清晰指南,並為實踐者在大型語言模型領域中探索強化學習提供了可靠的路線圖。最後,我們揭示了一種極簡的技術組合,能夠利用標準PPO損失解鎖無評論者策略的學習能力。結果表明,我們的簡單組合持續提升了性能,超越了如GRPO與DAPO等策略。
English
Reinforcement learning for LLM reasoning has rapidly emerged as a prominent research area, marked by a significant surge in related studies on both algorithmic innovations and practical applications. Despite this progress, several critical challenges remain, including the absence of standardized guidelines for employing RL techniques and a fragmented understanding of their underlying mechanisms. Additionally, inconsistent experimental settings, variations in training data, and differences in model initialization have led to conflicting conclusions, obscuring the key characteristics of these techniques and creating confusion among practitioners when selecting appropriate techniques. This paper systematically reviews widely adopted RL techniques through rigorous reproductions and isolated evaluations within a unified open-source framework. We analyze the internal mechanisms, applicable scenarios, and core principles of each technique through fine-grained experiments, including datasets of varying difficulty, model sizes, and architectures. Based on these insights, we present clear guidelines for selecting RL techniques tailored to specific setups, and provide a reliable roadmap for practitioners navigating the RL for the LLM domain. Finally, we reveal that a minimalist combination of two techniques can unlock the learning capability of critic-free policies using vanilla PPO loss. The results demonstrate that our simple combination consistently improves performance, surpassing strategies like GRPO and DAPO.
PDF264August 12, 2025