エージェントR: 反映するための言語モデルエージェントの訓練による反復的自己訓練

要旨

大規模言語モデル（LLMs）エージェントは、対話環境における複雑なタスクに取り組む際にますます重要となっています。既存の研究は、主に性能を向上させるために、より強力な専門家からの振る舞いクローンを通じて焦点を当てていますが、このようなアプローチは現実世界のアプリケーションではしばしば失敗することがあり、それは主にエラーからの回復能力の欠如によるものです。しかし、ステップレベルの批評データを収集することは困難でコストがかかります。そのため、自己批評データセットの自動化および動的構築が、モデルに知的エージェント機能を付与する上で重要です。本研究では、エージェントがリアルタイムで反省することを可能にする反復的な自己トレーニングフレームワークであるAgent-Rを提案します。正確性に基づいて行動を報酬または罰する従来の方法とは異なり、Agent-RはMCTSを活用して、誤った軌道から正しい軌道を回復するためのトレーニングデータを構築します。エージェントの反省の主な課題は、ロールアウトの最後まで待つのではなく、適時な修正が必要であることにあります。このため、我々は、モデルによって誘導される批評構築メカニズムを導入します。アクターモデルは、失敗した軌道の中で（現在の能力範囲内で）最初のエラーステップを特定します。それを起点に、同じ親ノードを共有する隣接する正しい経路と結合します。この戦略により、モデルは現在のポリシーに基づいて反省を学習し、したがってより良い学習効率をもたらします。この自己改善パラダイムのスケーラビリティをさらに探るために、エラー訂正能力とデータセット構築の反復的な改良を調査します。我々の調査結果は、Agent-Rがモデルのエラーからの回復能力を持続的に向上させ、適時なエラー訂正を可能にすることを示しています。3つの対話環境での実験では、Agent-Rがエージェントに誤った行動を修正する能力を効果的に装備し、ループを回避しつつ、基準方法に比べて優れたパフォーマンスを達成しています（+5.59％）。

English

Large Language Models (LLMs) agents are increasingly pivotal for addressing complex tasks in interactive environments. Existing work mainly focuses on enhancing performance through behavior cloning from stronger experts, yet such approaches often falter in real-world applications, mainly due to the inability to recover from errors. However, step-level critique data is difficult and expensive to collect. Automating and dynamically constructing self-critique datasets is thus crucial to empowering models with intelligent agent capabilities. In this work, we propose an iterative self-training framework, Agent-R, that enables language Agent to Reflect on the fly. Unlike traditional methods that reward or penalize actions based on correctness, Agent-R leverages MCTS to construct training data that recover correct trajectories from erroneous ones. A key challenge of agent reflection lies in the necessity for timely revision rather than waiting until the end of a rollout. To address this, we introduce a model-guided critique construction mechanism: the actor model identifies the first error step (within its current capability) in a failed trajectory. Starting from it, we splice it with the adjacent correct path, which shares the same parent node in the tree. This strategy enables the model to learn reflection based on its current policy, therefore yielding better learning efficiency. To further explore the scalability of this self-improvement paradigm, we investigate iterative refinement of both error correction capabilities and dataset construction. Our findings demonstrate that Agent-R continuously improves the model's ability to recover from errors and enables timely error correction. Experiments on three interactive environments show that Agent-R effectively equips agents to correct erroneous actions while avoiding loops, achieving superior performance compared to baseline methods (+5.59%).

エージェントR: 反映するための言語モデルエージェントの訓練による反復的自己訓練

Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training

要旨

Support