Role-Agent: デュアルロール進化によるLLMエージェントのブートストラッピング

要旨

大規模言語モデル（LLM）エージェントは複雑なタスクにおいて高い性能を示しているものの、その学習は非効率な対話フィードバックや静的な訓練環境に制約されることが多く、より広範な汎化の妨げとなっている。この限界に対処するため、本論文ではRole-Agentを提案する。これは、単一のLLMをエージェントと環境の両方として同時に機能させ、ブートストラップ型の共進化を実現するフレームワークである。Role-Agentは、相互補完的な2つの構成要素、すなわちWorld-In-Agent（WIA）とAgent-In-World（AIW）から成る。WIAでは、LLMがエージェントとして振る舞い、各行動後の将来状態を予測し、予測状態と実際の状態との一致度をプロセス報酬として利用することで、環境を考慮した推論を促進する。AIWでは、LLMが失敗軌跡から失敗モードを分析し、類似した失敗パターンを持つタスクを検索することで、訓練データ分布を再構成し、目標指向的な練習を可能にする。複数のベンチマークによる実験の結果、Role-Agentは一貫して性能を向上させ、強力なベースラインに対して平均4%以上の改善を示した。

English

Although Large Language Model (LLM) agents have demonstrated strong performance on complex tasks, their learning is often limited by inefficient interaction feedback and static training environments, which hinder broader generalization. To address these limitations, this paper introduces Role-Agent, black{a framework} that harnesses a single LLM to function concurrently as both the agent and the environment, enabling a bootstrapped co-evolution. Role-Agent comprises two synergistic components: World-In-Agent (WIA) and Agent-In-World (AIW). In WIA, the LLM acts as the agent and predicts future states after each action; the alignment between predicted and actual states is then used as a process reward, encouraging environment-aware reasoning. In AIW, the LLM analyzes failure modes from failed trajectories and retrieves tasks with similar failure patterns, thereby reshaping the training data distribution for targeted practice. Experiments on multiple benchmarks show that Role-Agent consistently improves performance, yielding an average gain of over 4\% over strong baselines.