ChatPaper.aiChatPaper

解构长周期工具使用智能体的强化学习:一套完整方法论

Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe

March 23, 2026
作者: Xixi Wu, Qianguo Sun, Ruiyang Zhang, Chao Song, Junlong Wu, Yiyan Qi, Hong Cheng
cs.AI

摘要

強化學習(RL)對於將大型語言模型(LLM)進化為具備長程規劃能力的自主智能體至關重要,然而在複雜多輪環境中擴展強化學習的實用方法仍屬空白。本文通過TravelPlanner(一個需要工具協調以滿足多維約束的挑戰性測試平台)開展系統性實證研究,將智能體強化學習的設計空間分解為五個維度:獎勵塑形、模型擴展、數據構成、算法選擇與環境穩定性。我們的對照實驗得出7項關鍵結論,例如:(1)獎勵與算法選擇具有規模依賴性——較小模型受益於分階段獎勵與增強探索,而較大模型使用簡單稠密獎勵即可高效收斂;(2)約1,000個訓練樣本配合難度均衡的混合數據,能實現領域內與跨領域性能的最佳平衡;(3)環境穩定性對防止策略退化至關重要。基於提煉的實踐方案,我們經強化學習訓練的模型在TravelPlanner上實現了最先進性能,顯著超越主流大型語言模型。
English
Reinforcement Learning (RL) is essential for evolving Large Language Models (LLMs) into autonomous agents capable of long-horizon planning, yet a practical recipe for scaling RL in complex, multi-turn environments remains elusive. This paper presents a systematic empirical study using TravelPlanner, a challenging testbed requiring tool orchestration to satisfy multifaceted constraints. We decompose the agentic RL design space along 5 axes: reward shaping, model scaling, data composition, algorithm selection, and environmental stability. Our controlled experiments yield 7 key takeaways, e.g., (1) reward and algorithm choices are scale-dependent as smaller models benefit from staged rewards and enhanced exploration, whereas larger models converge efficiently with simpler dense rewards, (2) ~ 1K training samples with a balanced difficulty mixture mark a sweet spot for both in-domain and out-of-domain performance, and (3) environmental stability is critical to prevent policy degradation. Based on our distilled recipe, our RL-trained models achieve state-of-the-art performance on TravelPlanner, significantly outperforming leading LLMs.
PDF22March 25, 2026