揭開強化學習在自主推理中的神秘面紗

摘要

近期，代理强化学习（agentic RL）的兴起展示了RL同样能有效提升大型语言模型（LLMs）的代理推理能力，然而其关键设计原则与最佳实践仍不明确。在本研究中，我们从数据、算法及推理模式三个核心视角出发，进行了全面而系统的探索，以揭示强化学习在代理推理中的奥秘。我们提炼出以下关键洞见：(i) 用真实的端到端工具使用轨迹替代拼接的合成轨迹，能显著强化监督微调（SFT）的初始化效果；高多样性、模型感知的数据集支撑探索，并大幅提升RL性能。(ii) 探索友好型技术对代理RL至关重要，如采用更高的奖励裁剪、过长的奖励塑形，以及保持适当的策略熵，均可提升训练效率。(iii) 采用较少工具调用的深思熟虑策略，优于频繁工具调用或冗长的自我推理，提高了工具使用效率及最终准确性。综合这些简单实践，我们一致性地增强了代理推理与训练效率，在挑战性基准测试中，使用较小模型取得了强劲成果，为未来代理RL研究奠定了实用基准。除了这些实证洞见，我们还贡献了一个高质量的、真实的端到端代理SFT数据集及一个高质量的RL数据集，并在包括AIME2024/AIME2025、GPQA-Diamond和LiveCodeBench-v6在内的四个挑战性基准上，验证了我们洞见在提升LLMs代理推理能力方面的有效性。遵循我们的方法，4B规模的模型也能在代理推理性能上超越32B规模的模型。代码与模型详见：https://github.com/Gen-Verse/Open-AgentRL。

English

Recently, the emergence of agentic RL has showcased that RL could also effectively improve the agentic reasoning ability of LLMs, yet the key design principles and optimal practices remain unclear. In this work, we conduct a comprehensive and systematic investigation to demystify reinforcement learning in agentic reasoning from three key perspectives: data, algorithm, and reasoning mode. We highlight our key insights: (i) Replacing stitched synthetic trajectories with real end-to-end tool-use trajectories yields a far stronger SFT initialization; high-diversity, model-aware datasets sustain exploration and markedly improve RL performance. (ii) Exploration-friendly techniques are crucial for agentic RL, such as clip higher, overlong reward shaping, and maintaining adequate policy entropy could improve the training efficiency. (iii) A deliberative strategy with fewer tool calls outperforms frequent tool calls or verbose self-reasoning, improving tool efficiency and final accuracy. Together, these simple practices consistently enhance agentic reasoning and training efficiency, achieving strong results on challenging benchmarks with smaller models, and establishing a practical baseline for future agentic RL research. Beyond these empirical insights, we further contribute a high-quality, real end-to-end agentic SFT dataset along with a high-quality RL dataset, and demonstrate the effectiveness of our insights in boosting the agentic reasoning ability of LLMs across four challenging benchmarks, including AIME2024/AIME2025, GPQA-Diamond, and LiveCodeBench-v6. With our recipes, 4B-sized models could also achieve superior agentic reasoning performance compared to 32B-sized models. Code and models: https://github.com/Gen-Verse/Open-AgentRL