ChatPaper.aiChatPaper

SPIN-Bench:大型語言模型在戰略規劃與社交推理上的表現如何?

SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?

March 16, 2025
作者: Jianzhu Yao, Kevin Wang, Ryan Hsieh, Haisu Zhou, Tianqing Zou, Zerui Cheng, Zhangyang Wang, Pramod Viswanath
cs.AI

摘要

在社交互動中進行推理和策略行為是智能的顯著特徵。這種形式的推理遠比在靜態環境中(如數學問題解決)的孤立規劃或推理任務更為複雜。本文介紹了戰略規劃、互動與協商(SPIN-Bench),這是一個新的多領域評估框架,旨在衡量戰略規劃和社交推理的智能水平。雖然許多現有基準聚焦於狹義的規劃或單一代理的推理,SPIN-Bench將經典的PDDL任務、競爭性棋盤遊戲、合作性卡牌遊戲以及多代理協商場景整合到一個統一的框架中。該框架不僅包含基準測試,還提供了一個模擬和評估多種社交情境的競技場,以測試AI代理的推理和策略行為。我們通過系統性地變化動作空間、狀態複雜度及互動代理的數量來構建SPIN-Bench基準,模擬了多種社交情境,其中成功不僅依賴於有條不紊的逐步決策,還需要對其他(對抗性或合作性)參與者的概念性推斷。我們的實驗表明,儘管當代大型語言模型在基本事實檢索和短期規劃上表現尚可,但在需要對大規模狀態空間進行深度多跳推理以及在不確定性下進行社交熟練協調的任務中,它們遇到了顯著的性能瓶頸。我們期待SPIN-Bench能成為未來研究堅固的多代理規劃、社交推理以及人機協作的催化劑。
English
Reasoning and strategic behavior in social interactions is a hallmark of intelligence. This form of reasoning is significantly more sophisticated than isolated planning or reasoning tasks in static settings (e.g., math problem solving). In this paper, we present Strategic Planning, Interaction, and Negotiation (SPIN-Bench), a new multi-domain evaluation designed to measure the intelligence of strategic planning and social reasoning. While many existing benchmarks focus on narrow planning or single-agent reasoning, SPIN-Bench combines classical PDDL tasks, competitive board games, cooperative card games, and multi-agent negotiation scenarios in one unified framework. The framework includes both a benchmark as well as an arena to simulate and evaluate the variety of social settings to test reasoning and strategic behavior of AI agents. We formulate the benchmark SPIN-Bench by systematically varying action spaces, state complexity, and the number of interacting agents to simulate a variety of social settings where success depends on not only methodical and step-wise decision making, but also conceptual inference of other (adversarial or cooperative) participants. Our experiments reveal that while contemporary LLMs handle basic fact retrieval and short-range planning reasonably well, they encounter significant performance bottlenecks in tasks requiring deep multi-hop reasoning over large state spaces and socially adept coordination under uncertainty. We envision SPIN-Bench as a catalyst for future research on robust multi-agent planning, social reasoning, and human--AI teaming.

Summary

AI-Generated Summary

PDF413March 18, 2025