LLM의 집단 지능 벤치마킹

초록

대형 언어 모델(LLMs)은 복잡한 추론 능력을 보여주지만, 자연스러운 군집 특성인 제한된 지역적 인지와 통신과 같은 엄격한 제약 하에서 다중 에이전트 시스템(MAS) 내에서의 창발적 조정 능력은, 특히 군집 지능의 미묘한 측면에 관해, 아직 크게 탐구되지 않았습니다. 기존 벤치마크는 종종 불완전한 시공간 정보를 가진 에이전트들이 운영할 때 발생하는 분산 조정의 독특한 도전을 충분히 포착하지 못합니다. 이러한 격차를 해소하기 위해, 우리는 분산 에이전트로 작동하는 LLMs의 군집 지능 능력을 체계적으로 평가하기 위한 새로운 벤치마크인 SwarmBench을 소개합니다. SwarmBench은 구성 가능한 2D 그리드 환경 내에서 다섯 가지 기본 MAS 조정 작업을 특징으로 하며, 에이전트들이 주로 지역적 감각 입력(k x k 시야)과 지역적 통신에 의존하도록 합니다. 우리는 조정 효과에 대한 메트릭을 제안하고 창발적 그룹 역학을 분석합니다. 제로샷 설정에서 여러 주요 LLMs를 평가한 결과, 작업 간에 상당한 성능 차이가 나타나며, 이는 지역적 정보 제약이 야기하는 어려움을 강조합니다. 일부 조정이 나타나기는 하지만, 결과는 이러한 분산 시나리오에서 불확실성 하에서의 견고한 계획 및 전략 수립의 한계를 보여줍니다. 군집과 유사한 조건에서 LLMs를 평가하는 것은 미래의 분산 시스템에서 그들의 잠재력을 실현하기 위해 중요합니다. 우리는 SwarmBench을 정의된 기계적 특성을 가진 사용자 정의 가능하고 확장 가능한 물리적 시스템을 기반으로 한 개방적이고 확장 가능한 툴킷으로 공개합니다. 이는 환경, 프롬프트, 평가 스크립트 및 생성된 포괄적인 실험 데이터셋을 제공하여, LLM 기반 MAS 조정 및 구체화된 MAS의 이론적 기반에 대한 재현 가능한 연구를 촉진하기 위해 노력합니다. 우리의 코드 저장소는 https://github.com/x66ccff/swarmbench에서 이용 가능합니다.

English

Large Language Models (LLMs) show potential for complex reasoning, yet their capacity for emergent coordination in Multi-Agent Systems (MAS) when operating under strict constraints-such as limited local perception and communication, characteristic of natural swarms-remains largely unexplored, particularly concerning the nuances of swarm intelligence. Existing benchmarks often do not fully capture the unique challenges of decentralized coordination that arise when agents operate with incomplete spatio-temporal information. To bridge this gap, we introduce SwarmBench, a novel benchmark designed to systematically evaluate the swarm intelligence capabilities of LLMs acting as decentralized agents. SwarmBench features five foundational MAS coordination tasks within a configurable 2D grid environment, forcing agents to rely primarily on local sensory input (k x k view) and local communication. We propose metrics for coordination effectiveness and analyze emergent group dynamics. Evaluating several leading LLMs in a zero-shot setting, we find significant performance variations across tasks, highlighting the difficulties posed by local information constraints. While some coordination emerges, results indicate limitations in robust planning and strategy formation under uncertainty in these decentralized scenarios. Assessing LLMs under swarm-like conditions is crucial for realizing their potential in future decentralized systems. We release SwarmBench as an open, extensible toolkit-built upon a customizable and scalable physical system with defined mechanical properties. It provides environments, prompts, evaluation scripts, and the comprehensive experimental datasets generated, aiming to foster reproducible research into LLM-based MAS coordination and the theoretical underpinnings of Embodied MAS. Our code repository is available at https://github.com/x66ccff/swarmbench.

LLM의 집단 지능 벤치마킹

Benchmarking LLMs' Swarm intelligence

초록

Support