ChatPaper.aiChatPaper

多轮代理强化学习实践指南

A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning

October 1, 2025
作者: Ruiyi Wang, Prithviraj Ammanabrolu
cs.AI

摘要

本研究探討了通過多輪強化學習訓練大型語言模型作為代理時,哪些方法實際有效,哪些無效。儘管該領域進展迅速,但現有框架和定義仍顯零散,缺乏對跨任務設計選擇的系統性表述與分析。為填補這一空白,我們首先將設計空間分解為三個相互關聯的支柱——環境、獎勵與策略,並基於實證研究提煉出在特定文本領域訓練LLM代理的配方。具體而言,我們測試了用於檢驗情境化具身推理的熱門領域TextWorld和ALFWorld,以及更偏向軟件工程風格的任務SWE-Gym。(i) 在環境方面,我們從狀態與動作空間的大小及最優解長度角度分析了任務複雜性的影響,發現即使領域內簡單的環境也能提供代理向更複雜任務泛化能力的信號。(ii) 對於獎勵機制,我們探討了相對獎勵稀疏性的影響,觀察到雖然密集的回合級獎勵能加速訓練,但性能與穩定性高度依賴於所選的強化學習算法。(iii) 在代理策略方面,我們研究了獎勵稀疏性與有偏(PPO, GRPO)及無偏(RLOO)策略梯度方法之間的相互作用,並展示了在固定預算下如何找到監督微調(SFT)與強化學習訓練的最佳比例。我們將這些發現凝練成一份訓練配方,指導跨三大支柱的協同設計,推動多輪代理強化學習的研究與實踐。代碼見:https://github.com/pearls-lab/meow-tea-taro
English
We study what actually works and what doesn't for training large language models as agents via multi-turn reinforcement learning. Despite rapid progress, existing frameworks and definitions are fragmented, and there is no systematic formulation or analysis of which design choices matter across tasks. We address this gap by first breaking down the design space into three inter-related pillars -- environment, reward, and policy -- and empirically derive a recipe for training LLM agents in situated textual domains. In particular, we test TextWorld and ALFWorld, popular domains for testing situated embodied reasoning, as well as SWE-Gym for more software engineering style tasks. (i) For the environment, we analyze the impacts of task complexity in terms of sizes of the state and action spaces as well as optimal solution length, finding that even simple environments within a domain can provide signal on how well an agent can generalize to more complex tasks. (ii) For the reward, we ablate relative reward sparsity, observing that while dense turn-level rewards accelerate training, performance and stability is highly dependent on the choice of RL algorithm. (iii) And for the agent's policy, we explore the interplay between reward sparsity and biased (PPO, GRPO) and unbiased (RLOO) policy gradient methods in addition to showing how to find the optimal Supervised Fine-tuning (SFT) to RL training ratio given a fixed budget. We distill these findings into a training recipe that guides co-design across the three pillars, facilitating research and practical efforts in multi-turn agentic RL. Code: https://github.com/pearls-lab/meow-tea-taro
PDF52October 6, 2025