解构长周期工具使用智能体的强化学习：一套完整方法论

摘要

強化學習（RL）對於將大型語言模型（LLM）進化為具備長程規劃能力的自主智能體至關重要，然而在複雜多輪環境中擴展強化學習的實用方法仍屬空白。本文通過TravelPlanner（一個需要工具協調以滿足多維約束的挑戰性測試平台）開展系統性實證研究，將智能體強化學習的設計空間分解為五個維度：獎勵塑形、模型擴展、數據構成、算法選擇與環境穩定性。我們的對照實驗得出7項關鍵結論，例如：（1）獎勵與算法選擇具有規模依賴性——較小模型受益於分階段獎勵與增強探索，而較大模型使用簡單稠密獎勵即可高效收斂；（2）約1,000個訓練樣本配合難度均衡的混合數據，能實現領域內與跨領域性能的最佳平衡；（3）環境穩定性對防止策略退化至關重要。基於提煉的實踐方案，我們經強化學習訓練的模型在TravelPlanner上實現了最先進性能，顯著超越主流大型語言模型。

English

Reinforcement Learning (RL) is essential for evolving Large Language Models (LLMs) into autonomous agents capable of long-horizon planning, yet a practical recipe for scaling RL in complex, multi-turn environments remains elusive. This paper presents a systematic empirical study using TravelPlanner, a challenging testbed requiring tool orchestration to satisfy multifaceted constraints. We decompose the agentic RL design space along 5 axes: reward shaping, model scaling, data composition, algorithm selection, and environmental stability. Our controlled experiments yield 7 key takeaways, e.g., (1) reward and algorithm choices are scale-dependent as smaller models benefit from staged rewards and enhanced exploration, whereas larger models converge efficiently with simpler dense rewards, (2) ~ 1K training samples with a balanced difficulty mixture mark a sweet spot for both in-domain and out-of-domain performance, and (3) environmental stability is critical to prevent policy degradation. Based on our distilled recipe, our RL-trained models achieve state-of-the-art performance on TravelPlanner, significantly outperforming leading LLMs.

解构长周期工具使用智能体的强化学习：一套完整方法论

Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe

摘要

Support