ChatPaper.aiChatPaper

揭秘强化学习在自主推理中的应用

Demystifying Reinforcement Learning in Agentic Reasoning

October 13, 2025
作者: Zhaochen Yu, Ling Yang, Jiaru Zou, Shuicheng Yan, Mengdi Wang
cs.AI

摘要

近期,代理强化学习(agentic RL)的出现表明,强化学习同样能有效提升大语言模型(LLMs)的代理推理能力,然而其关键设计原则与最佳实践仍不明确。在本研究中,我们从数据、算法及推理模式三个核心视角出发,进行了全面而系统的探索,以揭示强化学习在代理推理中的应用奥秘。我们提炼出以下关键洞见:(一)以真实端到端工具使用轨迹替代拼接的合成轨迹,能显著强化监督微调(SFT)的初始化效果;高多样性且模型感知的数据集支持探索,并大幅提升强化学习性能。(二)探索友好型技术对代理强化学习至关重要,如采用更高的奖励裁剪、过长的奖励塑形及保持适当的策略熵,均能提高训练效率。(三)采用深思熟虑的策略,减少工具调用次数,相较于频繁调用工具或冗长的自我推理,能提升工具使用效率及最终准确率。综合这些简单实践,我们一致性地增强了代理推理能力与训练效率,在挑战性基准测试中,即便使用较小模型也取得了强劲成果,为未来代理强化学习研究奠定了实用基准。除上述实证洞见外,我们还贡献了一个高质量的、真实端到端代理SFT数据集及一个高质量的强化学习数据集,并在包括AIME2024/AIME2025、GPQA-Diamond及LiveCodeBench-v6在内的四个挑战性基准上,验证了我们的洞见在提升LLMs代理推理能力方面的有效性。遵循我们的方法,4B规模的模型也能在代理推理性能上超越32B规模的模型。代码与模型详见:https://github.com/Gen-Verse/Open-AgentRL。
English
Recently, the emergence of agentic RL has showcased that RL could also effectively improve the agentic reasoning ability of LLMs, yet the key design principles and optimal practices remain unclear. In this work, we conduct a comprehensive and systematic investigation to demystify reinforcement learning in agentic reasoning from three key perspectives: data, algorithm, and reasoning mode. We highlight our key insights: (i) Replacing stitched synthetic trajectories with real end-to-end tool-use trajectories yields a far stronger SFT initialization; high-diversity, model-aware datasets sustain exploration and markedly improve RL performance. (ii) Exploration-friendly techniques are crucial for agentic RL, such as clip higher, overlong reward shaping, and maintaining adequate policy entropy could improve the training efficiency. (iii) A deliberative strategy with fewer tool calls outperforms frequent tool calls or verbose self-reasoning, improving tool efficiency and final accuracy. Together, these simple practices consistently enhance agentic reasoning and training efficiency, achieving strong results on challenging benchmarks with smaller models, and establishing a practical baseline for future agentic RL research. Beyond these empirical insights, we further contribute a high-quality, real end-to-end agentic SFT dataset along with a high-quality RL dataset, and demonstrate the effectiveness of our insights in boosting the agentic reasoning ability of LLMs across four challenging benchmarks, including AIME2024/AIME2025, GPQA-Diamond, and LiveCodeBench-v6. With our recipes, 4B-sized models could also achieve superior agentic reasoning performance compared to 32B-sized models. Code and models: https://github.com/Gen-Verse/Open-AgentRL
PDF302October 14, 2025