InMind：评估大语言模型在捕捉与应用个体人类推理风格方面的能力

摘要

大型语言模型（LLMs）在人类中心推理任务中展现了强劲性能。尽管先前的评估已探讨了LLMs能否推断意图或识别欺骗，但它们往往忽视了影响人们在社交情境中解读与行动的个体化推理风格。社交推理游戏（SDGs）为评估个体化推理风格提供了一个天然的测试平台，其中不同玩家在相同条件下可能采用多样但情境有效的推理策略。为此，我们引入了InMind，一个基于认知科学的评估框架，旨在评估LLMs能否在SDGs中捕捉并应用个性化推理风格。InMind通过增强结构化游戏数据，包括回合级策略轨迹和赛后反思，这些数据在观察者与参与者模式下收集，支持四项认知驱动的任务，共同评估静态对齐与动态适应能力。作为案例研究，我们将InMind应用于《阿瓦隆》游戏，评估了11个顶尖LLMs。通用型LLMs，即便是GPT-4o，也常依赖词汇线索，难以将反思锚定于时间序列的游戏进程或适应策略的演变。相比之下，如DeepSeek-R1等推理增强型LLMs则展现出初步的风格敏感推理迹象。这些发现揭示了当前LLMs在个体化、适应性推理能力上的关键局限，并定位InMind为迈向认知对齐的人机交互的一步。

English

LLMs have shown strong performance on human-centric reasoning tasks. While previous evaluations have explored whether LLMs can infer intentions or detect deception, they often overlook the individualized reasoning styles that influence how people interpret and act in social contexts. Social deduction games (SDGs) provide a natural testbed for evaluating individualized reasoning styles, where different players may adopt diverse but contextually valid reasoning strategies under identical conditions. To address this, we introduce InMind, a cognitively grounded evaluation framework designed to assess whether LLMs can capture and apply personalized reasoning styles in SDGs. InMind enhances structured gameplay data with round-level strategy traces and post-game reflections, collected under both Observer and Participant modes. It supports four cognitively motivated tasks that jointly evaluate both static alignment and dynamic adaptation. As a case study, we apply InMind to the game Avalon, evaluating 11 state-of-the-art LLMs. General-purpose LLMs, even GPT-4o frequently rely on lexical cues, struggling to anchor reflections in temporal gameplay or adapt to evolving strategies. In contrast, reasoning-enhanced LLMs like DeepSeek-R1 exhibit early signs of style-sensitive reasoning. These findings reveal key limitations in current LLMs' capacity for individualized, adaptive reasoning, and position InMind as a step toward cognitively aligned human-AI interaction.

InMind：评估大语言模型在捕捉与应用个体人类推理风格方面的能力

InMind: Evaluating LLMs in Capturing and Applying Individual Human Reasoning Styles

摘要

Support