評估大型語言模型的群體智慧

摘要

大型語言模型（LLMs）在複雜推理方面展現出潛力，然而，在嚴格約束條件下——如自然群體中典型的有限局部感知與通信——其於多智能體系統（MAS）中湧現協調的能力仍大多未被探索，尤其是在群體智能的細微差異方面。現有的基準測試往往未能充分捕捉到當智能體在時空信息不完全的情況下運作時，所產生的去中心化協調的獨特挑戰。為彌補這一差距，我們引入了SwarmBench，這是一個新穎的基準測試，旨在系統評估作為去中心化智能體運作的LLMs的群體智能能力。SwarmBench在一個可配置的二維網格環境中，設定了五項基礎的MAS協調任務，迫使智能體主要依賴於局部感官輸入（k x k視野）和局部通信。我們提出了協調有效性的度量標準，並分析了湧現的群體動態。在零樣本設置下評估多個領先的LLMs，我們發現不同任務間存在顯著的性能差異，凸顯了局部信息約束帶來的困難。雖然某些協調行為得以湧現，但結果表明，在這些去中心化場景中，面對不確定性時，穩健的規劃與策略形成能力仍存在限制。在類似群體的條件下評估LLMs，對於實現其在未來去中心化系統中的潛力至關重要。我們將SwarmBench作為一個開放、可擴展的工具包發布——它基於一個具有定義機械特性的可定制且可擴展的物理系統構建。它提供了環境、提示、評估腳本以及生成的全面實驗數據集，旨在促進基於LLM的MAS協調及具身MAS理論基礎的可重複研究。我們的代碼庫可在https://github.com/x66ccff/swarmbench獲取。

English

Large Language Models (LLMs) show potential for complex reasoning, yet their capacity for emergent coordination in Multi-Agent Systems (MAS) when operating under strict constraints-such as limited local perception and communication, characteristic of natural swarms-remains largely unexplored, particularly concerning the nuances of swarm intelligence. Existing benchmarks often do not fully capture the unique challenges of decentralized coordination that arise when agents operate with incomplete spatio-temporal information. To bridge this gap, we introduce SwarmBench, a novel benchmark designed to systematically evaluate the swarm intelligence capabilities of LLMs acting as decentralized agents. SwarmBench features five foundational MAS coordination tasks within a configurable 2D grid environment, forcing agents to rely primarily on local sensory input (k x k view) and local communication. We propose metrics for coordination effectiveness and analyze emergent group dynamics. Evaluating several leading LLMs in a zero-shot setting, we find significant performance variations across tasks, highlighting the difficulties posed by local information constraints. While some coordination emerges, results indicate limitations in robust planning and strategy formation under uncertainty in these decentralized scenarios. Assessing LLMs under swarm-like conditions is crucial for realizing their potential in future decentralized systems. We release SwarmBench as an open, extensible toolkit-built upon a customizable and scalable physical system with defined mechanical properties. It provides environments, prompts, evaluation scripts, and the comprehensive experimental datasets generated, aiming to foster reproducible research into LLM-based MAS coordination and the theoretical underpinnings of Embodied MAS. Our code repository is available at https://github.com/x66ccff/swarmbench.

評估大型語言模型的群體智慧

Benchmarking LLMs' Swarm intelligence

摘要

Support