揭秘长周期工具使用智能体的强化学习：一套完整方法指南

摘要

强化学习（RL）对于推动大语言模型（LLMs）进化为具备长程规划能力的自主智能体至关重要，然而在复杂多轮环境中扩展RL技术的实用方案仍属空白。本文通过TravelPlanner（一个需要工具协调以满足多维度约束的挑战性测试平台）开展系统性实证研究，将智能体RL设计空间分解为五个维度：奖励塑造、模型缩放、数据构成、算法选择和环境稳定性。通过受控实验我们得出七项关键发现，例如：（1）奖励机制与算法选择具有规模依赖性——较小模型受益于分阶段奖励和增强探索，而较大模型通过简单密集奖励即可高效收敛；（2）约1K个难度均衡的训练样本构成领域内外性能的最佳平衡点；（3）环境稳定性对防止策略退化具有关键作用。基于提炼的实施方案，我们经RL训练的模型在TravelPlanner上实现了最先进性能，显著超越了主流LLMs。

English

Reinforcement Learning (RL) is essential for evolving Large Language Models (LLMs) into autonomous agents capable of long-horizon planning, yet a practical recipe for scaling RL in complex, multi-turn environments remains elusive. This paper presents a systematic empirical study using TravelPlanner, a challenging testbed requiring tool orchestration to satisfy multifaceted constraints. We decompose the agentic RL design space along 5 axes: reward shaping, model scaling, data composition, algorithm selection, and environmental stability. Our controlled experiments yield 7 key takeaways, e.g., (1) reward and algorithm choices are scale-dependent as smaller models benefit from staged rewards and enhanced exploration, whereas larger models converge efficiently with simpler dense rewards, (2) ~ 1K training samples with a balanced difficulty mixture mark a sweet spot for both in-domain and out-of-domain performance, and (3) environmental stability is critical to prevent policy degradation. Based on our distilled recipe, our RL-trained models achieve state-of-the-art performance on TravelPlanner, significantly outperforming leading LLMs.