다중 에이전트 도구 통합 정책 최적화

초록

대규모 언어 모델(LLMs)은 지식 집약적이고 복잡한 추론 작업을 위해 점점 더 다중 턴 도구 통합 계획에 의존하고 있다. 기존 구현은 일반적으로 단일 에이전트에 의존하지만, 제한된 컨텍스트 길이와 잡음이 있는 도구 응답으로 인해 어려움을 겪는다. 이러한 문제에 대한 자연스러운 해결책은 컨텍스트를 관리하기 위해 플래너 및 워커 에이전트를 포함한 다중 에이전트 프레임워크를 채택하는 것이다. 그러나 기존의 방법들은 도구 통합 다중 에이전트 프레임워크의 효과적인 강화 학습 사후 훈련을 지원하지 않는다. 이러한 격차를 해결하기 위해, 우리는 다중 에이전트 도구 통합 정책 최적화(Multi-Agent Tool-Integrated Policy Optimization, MATPO)를 제안한다. MATPO는 강화 학습을 통해 단일 LLM 인스턴스 내에서 플래너와 워커와 같은 구별된 역할을 역할별 프롬프트를 사용하여 훈련할 수 있게 한다. MATPO는 플래너와 워커 롤아웃 간의 원칙적인 크레딧 할당 메커니즘에서 도출된다. 이 설계는 메모리 집약적인 다중 LLM 배포의 필요성을 제거하면서도 전문화의 이점을 유지한다. GAIA-text, WebWalkerQA 및 FRAMES에 대한 실험에서 MATPO는 단일 에이전트 기준선 대비 평균 18.38%의 상대적 성능 향상을 보였으며, 잡음이 있는 도구 출력에 대해 더 큰 견고성을 나타냈다. 우리의 연구 결과는 단일 LLM 내에서 다중 에이전트 역할을 통합하는 효과를 강조하며, 안정적이고 효율적인 다중 에이전트 강화 학습 훈련을 위한 실용적인 통찰을 제공한다.

English

Large language models (LLMs) increasingly rely on multi-turn tool-integrated planning for knowledge-intensive and complex reasoning tasks. Existing implementations typically rely on a single agent, but they suffer from limited context length and noisy tool responses. A natural solution is to adopt a multi-agent framework with planner- and worker-agents to manage context. However, no existing methods support effective reinforcement learning post-training of tool-integrated multi-agent frameworks. To address this gap, we propose Multi-Agent Tool-Integrated Policy Optimization (MATPO), which enables distinct roles (planner and worker) to be trained within a single LLM instance using role-specific prompts via reinforcement learning. MATPO is derived from a principled credit assignment mechanism across planner and worker rollouts. This design eliminates the need to deploy multiple LLMs, which would be memory-intensive, while preserving the benefits of specialization. Experiments on GAIA-text, WebWalkerQA, and FRAMES show that MATPO consistently outperforms single-agent baselines by an average of 18.38% relative improvement in performance and exhibits greater robustness to noisy tool outputs. Our findings highlight the effectiveness of unifying multiple agent roles within a single LLM and provide practical insights for stable and efficient multi-agent RL training.