MultiAgentBench: LLM 에이전트의 협업 및 경쟁 평가

초록

대규모 언어 모델(LLM)은 자율 에이전트로서 놀라운 능력을 보여주고 있지만, 기존 벤치마크는 단일 에이전트 작업에 초점을 맞추거나 좁은 도메인에 국한되어 있어 다중 에이전트 간의 협력과 경쟁의 역동성을 포착하지 못하고 있습니다. 본 논문에서는 다양한 상호작용 시나리오에서 LLM 기반 다중 에이전트 시스템을 평가하기 위해 설계된 포괄적인 벤치마크인 MultiAgentBench를 소개합니다. 우리의 프레임워크는 작업 완료뿐만 아니라 협력과 경쟁의 질을 측정하기 위해 새로운 마일스톤 기반 핵심 성과 지표를 사용합니다. 또한, 스타, 체인, 트리, 그래프 토폴로지를 포함한 다양한 조정 프로토콜과 그룹 토론 및 인지 계획과 같은 혁신적인 전략을 평가합니다. 특히, gpt-4o-mini는 평균 최고 작업 점수를 달성했으며, 연구 시나리오에서 조정 프로토콜 중 그래프 구조가 가장 우수한 성능을 보였고, 인지 계획은 마일스톤 달성률을 3% 향상시켰습니다. 코드와 데이터셋은 https://github.com/MultiagentBench/MARBLE에서 공개되어 있습니다.

English

Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents, yet existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition. In this paper, we introduce MultiAgentBench, a comprehensive benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios. Our framework measures not only task completion but also the quality of collaboration and competition using novel, milestone-based key performance indicators. Moreover, we evaluate various coordination protocols (including star, chain, tree, and graph topologies) and innovative strategies such as group discussion and cognitive planning. Notably, gpt-4o-mini reaches the average highest task score, graph structure performs the best among coordination protocols in the research scenario, and cognitive planning improves milestone achievement rates by 3%. Code and datasets are public available at https://github.com/MultiagentBench/MARBLE.

MultiAgentBench: LLM 에이전트의 협업 및 경쟁 평가

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

초록

Support