マルチエージェントツール統合型ポリシー最適化

要旨

大規模言語モデル（LLMs）は、知識集約的かつ複雑な推論タスクにおいて、多段階のツール統合計画にますます依存するようになっている。既存の実装は通常、単一のエージェントに依存しているが、これらはコンテキスト長の制限やノイズの多いツール応答に悩まされている。この問題に対する自然な解決策は、プランナーとワーカーのエージェントを用いたマルチエージェントフレームワークを採用し、コンテキストを管理することである。しかし、既存の手法では、ツール統合マルチエージェントフレームワークの効果的な強化学習による事後学習をサポートしていない。このギャップを埋めるため、我々はMulti-Agent Tool-Integrated Policy Optimization（MATPO）を提案する。MATPOは、役割固有のプロンプトを用いて、単一のLLMインスタンス内でプランナーとワーカーの異なる役割を強化学習によって訓練することを可能にする。MATPOは、プランナーとワーカーのロールアウトにわたる原則的なクレジット割り当てメカニズムに基づいて導出される。この設計により、メモリ集約的となる複数のLLMをデプロイする必要がなくなりつつ、専門化の利点を保持することができる。GAIA-text、WebWalkerQA、およびFRAMESでの実験により、MATPOが単一エージェントのベースラインを平均18.38%の相対的な性能向上で一貫して上回り、ノイズの多いツール出力に対してより高いロバスト性を示すことが確認された。我々の知見は、単一のLLM内で複数のエージェント役割を統合することの有効性を強調し、安定かつ効率的なマルチエージェントRL訓練のための実践的な洞察を提供する。

English

Large language models (LLMs) increasingly rely on multi-turn tool-integrated planning for knowledge-intensive and complex reasoning tasks. Existing implementations typically rely on a single agent, but they suffer from limited context length and noisy tool responses. A natural solution is to adopt a multi-agent framework with planner- and worker-agents to manage context. However, no existing methods support effective reinforcement learning post-training of tool-integrated multi-agent frameworks. To address this gap, we propose Multi-Agent Tool-Integrated Policy Optimization (MATPO), which enables distinct roles (planner and worker) to be trained within a single LLM instance using role-specific prompts via reinforcement learning. MATPO is derived from a principled credit assignment mechanism across planner and worker rollouts. This design eliminates the need to deploy multiple LLMs, which would be memory-intensive, while preserving the benefits of specialization. Experiments on GAIA-text, WebWalkerQA, and FRAMES show that MATPO consistently outperforms single-agent baselines by an average of 18.38% relative improvement in performance and exhibits greater robustness to noisy tool outputs. Our findings highlight the effectiveness of unifying multiple agent roles within a single LLM and provide practical insights for stable and efficient multi-agent RL training.