揭秘长周期工具使用智能体的强化学习:一套完整方法指南
Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe
March 23, 2026
作者: Xixi Wu, Qianguo Sun, Ruiyang Zhang, Chao Song, Junlong Wu, Yiyan Qi, Hong Cheng
cs.AI
摘要
强化学习(RL)对于推动大语言模型(LLMs)进化为具备长程规划能力的自主智能体至关重要,然而在复杂多轮环境中扩展RL技术的实用方案仍属空白。本文通过TravelPlanner(一个需要工具协调以满足多维度约束的挑战性测试平台)开展系统性实证研究,将智能体RL设计空间分解为五个维度:奖励塑造、模型缩放、数据构成、算法选择和环境稳定性。通过受控实验我们得出七项关键发现,例如:(1)奖励机制与算法选择具有规模依赖性——较小模型受益于分阶段奖励和增强探索,而较大模型通过简单密集奖励即可高效收敛;(2)约1K个难度均衡的训练样本构成领域内外性能的最佳平衡点;(3)环境稳定性对防止策略退化具有关键作用。基于提炼的实施方案,我们经RL训练的模型在TravelPlanner上实现了最先进性能,显著超越了主流LLMs。
English
Reinforcement Learning (RL) is essential for evolving Large Language Models (LLMs) into autonomous agents capable of long-horizon planning, yet a practical recipe for scaling RL in complex, multi-turn environments remains elusive. This paper presents a systematic empirical study using TravelPlanner, a challenging testbed requiring tool orchestration to satisfy multifaceted constraints. We decompose the agentic RL design space along 5 axes: reward shaping, model scaling, data composition, algorithm selection, and environmental stability. Our controlled experiments yield 7 key takeaways, e.g., (1) reward and algorithm choices are scale-dependent as smaller models benefit from staged rewards and enhanced exploration, whereas larger models converge efficiently with simpler dense rewards, (2) ~ 1K training samples with a balanced difficulty mixture mark a sweet spot for both in-domain and out-of-domain performance, and (3) environmental stability is critical to prevent policy degradation. Based on our distilled recipe, our RL-trained models achieve state-of-the-art performance on TravelPlanner, significantly outperforming leading LLMs.