大语言模型群体智能基准测试

摘要

大型语言模型（LLMs）在复杂推理方面展现出潜力，然而，在严格约束条件下——如自然群体中常见的有限局部感知与通信——其于多智能体系统（MAS）中涌现协调能力的研究尚属空白，尤其是在群体智能的细微之处。现有基准往往未能充分捕捉到智能体在时空信息不完整时进行去中心化协调所面临的独特挑战。为填补这一空白，我们推出了SwarmBench，一个旨在系统评估作为去中心化代理的LLMs群体智能能力的新颖基准。SwarmBench包含五个基础MAS协调任务，设置于可配置的二维网格环境中，迫使智能体主要依赖局部感官输入（k x k视野）及局部通信。我们提出了协调效率的度量标准，并分析了涌现的群体动态。在零样本设置下评估多个领先的LLMs，我们发现不同任务间存在显著的性能差异，凸显了局部信息约束带来的困难。尽管出现了一定的协调，但结果表明，在这些去中心化场景下，面对不确定性时，LLMs在稳健规划与策略形成方面仍存在局限。在类似群体条件下评估LLMs，对于实现其在未来去中心化系统中的潜力至关重要。我们以开放、可扩展的工具包形式发布SwarmBench，它基于一个具有明确机械特性的可定制且可扩展的物理系统构建，提供了环境、提示、评估脚本及生成的全面实验数据集，旨在促进基于LLM的MAS协调及具身MAS理论基础的可重复研究。我们的代码仓库位于https://github.com/x66ccff/swarmbench。

English

Large Language Models (LLMs) show potential for complex reasoning, yet their capacity for emergent coordination in Multi-Agent Systems (MAS) when operating under strict constraints-such as limited local perception and communication, characteristic of natural swarms-remains largely unexplored, particularly concerning the nuances of swarm intelligence. Existing benchmarks often do not fully capture the unique challenges of decentralized coordination that arise when agents operate with incomplete spatio-temporal information. To bridge this gap, we introduce SwarmBench, a novel benchmark designed to systematically evaluate the swarm intelligence capabilities of LLMs acting as decentralized agents. SwarmBench features five foundational MAS coordination tasks within a configurable 2D grid environment, forcing agents to rely primarily on local sensory input (k x k view) and local communication. We propose metrics for coordination effectiveness and analyze emergent group dynamics. Evaluating several leading LLMs in a zero-shot setting, we find significant performance variations across tasks, highlighting the difficulties posed by local information constraints. While some coordination emerges, results indicate limitations in robust planning and strategy formation under uncertainty in these decentralized scenarios. Assessing LLMs under swarm-like conditions is crucial for realizing their potential in future decentralized systems. We release SwarmBench as an open, extensible toolkit-built upon a customizable and scalable physical system with defined mechanical properties. It provides environments, prompts, evaluation scripts, and the comprehensive experimental datasets generated, aiming to foster reproducible research into LLM-based MAS coordination and the theoretical underpinnings of Embodied MAS. Our code repository is available at https://github.com/x66ccff/swarmbench.

大语言模型群体智能基准测试

Benchmarking LLMs' Swarm intelligence

摘要

Support