実践者のためのマルチターン・エージェンティック強化学習ガイド

要旨

大規模言語モデル（LLM）をエージェントとしてマルチターン強化学習（RL）で訓練する際に、実際に有効な手法とそうでないものを研究します。急速な進展にもかかわらず、既存のフレームワークや定義は断片的であり、タスク間でどの設計選択が重要かについて体系的な定式化や分析が欠けています。このギャップを埋めるため、まず設計空間を3つの相互に関連する柱——環境、報酬、ポリシー——に分解し、状況依存のテキスト領域でLLMエージェントを訓練するためのレシピを実証的に導出します。特に、状況依存の具象的推論をテストするための人気領域であるTextWorldとALFWorld、およびソフトウェアエンジニアリングスタイルのタスクに適したSWE-Gymを検証します。(i) 環境に関しては、状態空間と行動空間のサイズ、および最適解の長さという観点からタスクの複雑さの影響を分析し、領域内の単純な環境でも、エージェントがより複雑なタスクに一般化できるかどうかの信号を提供できることを明らかにします。(ii) 報酬に関しては、相対的な報酬の希薄性を除去し、ターンレベルの密な報酬が訓練を加速する一方で、性能と安定性はRLアルゴリズムの選択に大きく依存することを観察します。(iii) エージェントのポリシーに関しては、報酬の希薄性とバイアス付き（PPO、GRPO）およびバイアスなし（RLOO）のポリシー勾配法の相互作用を探り、固定予算内で最適な教師あり微調整（SFT）からRL訓練への比率を見つける方法を示します。これらの知見を、3つの柱全体にわたる共同設計を導く訓練レシピに凝縮し、マルチターンエージェントRLの研究と実践的な取り組みを促進します。コード: https://github.com/pearls-lab/meow-tea-taro

English

We study what actually works and what doesn't for training large language models as agents via multi-turn reinforcement learning. Despite rapid progress, existing frameworks and definitions are fragmented, and there is no systematic formulation or analysis of which design choices matter across tasks. We address this gap by first breaking down the design space into three inter-related pillars -- environment, reward, and policy -- and empirically derive a recipe for training LLM agents in situated textual domains. In particular, we test TextWorld and ALFWorld, popular domains for testing situated embodied reasoning, as well as SWE-Gym for more software engineering style tasks. (i) For the environment, we analyze the impacts of task complexity in terms of sizes of the state and action spaces as well as optimal solution length, finding that even simple environments within a domain can provide signal on how well an agent can generalize to more complex tasks. (ii) For the reward, we ablate relative reward sparsity, observing that while dense turn-level rewards accelerate training, performance and stability is highly dependent on the choice of RL algorithm. (iii) And for the agent's policy, we explore the interplay between reward sparsity and biased (PPO, GRPO) and unbiased (RLOO) policy gradient methods in addition to showing how to find the optimal Supervised Fine-tuning (SFT) to RL training ratio given a fixed budget. We distill these findings into a training recipe that guides co-design across the three pillars, facilitating research and practical efforts in multi-turn agentic RL. Code: https://github.com/pearls-lab/meow-tea-taro

実践者のためのマルチターン・エージェンティック強化学習ガイド

A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning

要旨

Support