SPIN-Bench:大语言模型在战略规划与社会推理方面的表现如何?
SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?
March 16, 2025
作者: Jianzhu Yao, Kevin Wang, Ryan Hsieh, Haisu Zhou, Tianqing Zou, Zerui Cheng, Zhangyang Wang, Pramod Viswanath
cs.AI
摘要
社交互动中的推理与策略行为是智能的重要标志。这种推理形式远比静态环境下的孤立规划或推理任务(如数学问题求解)更为复杂。本文提出了一种新的多领域评估框架——战略规划、互动与协商(SPIN-Bench),旨在衡量战略规划与社交推理的智能水平。尽管现有许多基准测试专注于狭窄的规划或单智能体推理,SPIN-Bench将经典的PDDL任务、竞争性棋盘游戏、合作性卡牌游戏以及多智能体协商场景统一于一个框架之中。该框架不仅包含基准测试,还提供了一个模拟和评估多种社交场景的竞技场,以测试AI智能体的推理与策略行为。我们通过系统性地改变动作空间、状态复杂度及交互智能体数量,构建了SPIN-Bench基准,模拟了多种社交环境,其中成功不仅依赖于有条不紊的逐步决策,还需对其他(对抗性或合作性)参与者的概念性推断。实验表明,尽管当代大型语言模型在基础事实检索和短期规划上表现尚可,但在需要跨越大规模状态空间的深度多跳推理及不确定性下的社交协调任务中,它们遇到了显著的性能瓶颈。我们期待SPIN-Bench能成为未来研究稳健多智能体规划、社交推理及人机协作的催化剂。
English
Reasoning and strategic behavior in social interactions is a hallmark
of intelligence. This form of reasoning is significantly more sophisticated
than isolated planning or reasoning tasks in static settings (e.g., math
problem solving). In this paper, we present Strategic Planning,
Interaction, and Negotiation (SPIN-Bench), a new multi-domain
evaluation designed to measure the intelligence of strategic planning
and social reasoning. While many existing benchmarks focus on narrow
planning or single-agent reasoning, SPIN-Bench combines classical PDDL tasks,
competitive board games, cooperative card games, and multi-agent negotiation
scenarios in one unified framework. The framework includes both a benchmark as
well as an arena to simulate and evaluate the variety of social settings to
test reasoning and strategic behavior of AI agents. We formulate the benchmark
SPIN-Bench by systematically varying action spaces, state complexity, and the
number of interacting agents to simulate a variety of social settings where
success depends on not only methodical and step-wise decision making, but also
conceptual inference of other (adversarial or cooperative) participants.
Our experiments reveal that while contemporary LLMs handle basic fact
retrieval and short-range planning reasonably well, they encounter
significant performance bottlenecks in tasks requiring deep multi-hop
reasoning over large state spaces and socially adept coordination under
uncertainty. We envision SPIN-Bench as a catalyst for future research on robust
multi-agent planning, social reasoning, and human--AI teaming.Summary
AI-Generated Summary