ChatPaper.aiChatPaper

MultiAgentBench:評估LLM代理的協作與競爭能力

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

March 3, 2025
作者: Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Zhe Wang, Zhenhailong Wang, Cheng Qian, Xiangru Tang, Heng Ji, Jiaxuan You
cs.AI

摘要

大型語言模型(LLMs)作為自主代理已展現出卓越的能力,然而現有的基準測試要么專注於單一代理任務,要么局限於狹窄的領域,未能捕捉多代理協調與競爭的動態。本文介紹了MultiAgentBench,這是一個全面的基準測試,旨在評估基於LLM的多代理系統在多樣化、互動場景中的表現。我們的框架不僅衡量任務完成度,還通過新穎的基於里程碑的關鍵性能指標來評估合作與競爭的質量。此外,我們評估了各種協調協議(包括星型、鏈型、樹型和圖形拓撲)以及創新策略,如群體討論和認知規劃。值得注意的是,gpt-4o-mini在研究場景中達到了平均最高的任務分數,圖形結構在協調協議中表現最佳,而認知規劃則將里程碑達成率提高了3%。代碼和數據集已公開於https://github.com/MultiagentBench/MARBLE。
English
Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents, yet existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition. In this paper, we introduce MultiAgentBench, a comprehensive benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios. Our framework measures not only task completion but also the quality of collaboration and competition using novel, milestone-based key performance indicators. Moreover, we evaluate various coordination protocols (including star, chain, tree, and graph topologies) and innovative strategies such as group discussion and cognitive planning. Notably, gpt-4o-mini reaches the average highest task score, graph structure performs the best among coordination protocols in the research scenario, and cognitive planning improves milestone achievement rates by 3%. Code and datasets are public available at https://github.com/MultiagentBench/MARBLE.

Summary

AI-Generated Summary

PDF273March 5, 2025