InMind：評估大型語言模型在捕捉與應用個體人類推理風格上的表現

摘要

大型語言模型（LLMs）在人類中心推理任務上展現了強大的性能。雖然先前的評估已探討過LLMs是否能推斷意圖或偵測欺騙，但它們往往忽略了影響人們在社交情境中解讀與行動的個體化推理風格。社交推理遊戲（SDGs）為評估個體化推理風格提供了一個自然的測試平台，其中不同玩家在相同條件下可能採用多樣但情境有效的推理策略。為此，我們引入了InMind，這是一個基於認知科學的評估框架，旨在評估LLMs能否在SDGs中捕捉並應用個性化的推理風格。InMind通過在觀察者與參與者模式下收集的回合級策略軌跡與賽後反思，增強了結構化的遊戲數據。它支持四項認知驅動的任務，共同評估靜態對齊與動態適應。作為案例研究，我們將InMind應用於遊戲《阿瓦隆》，評估了11個最先進的LLMs。通用型LLMs，即便是GPT-4o，也常依賴詞彙線索，難以將反思錨定於遊戲進程或適應策略的演變。相比之下，如DeepSeek-R1等推理增強型LLMs則展現出風格敏感推理的早期跡象。這些發現揭示了當前LLMs在個體化、適應性推理能力上的關鍵限制，並將InMind定位為邁向認知對齊的人機互動的一步。

English

LLMs have shown strong performance on human-centric reasoning tasks. While previous evaluations have explored whether LLMs can infer intentions or detect deception, they often overlook the individualized reasoning styles that influence how people interpret and act in social contexts. Social deduction games (SDGs) provide a natural testbed for evaluating individualized reasoning styles, where different players may adopt diverse but contextually valid reasoning strategies under identical conditions. To address this, we introduce InMind, a cognitively grounded evaluation framework designed to assess whether LLMs can capture and apply personalized reasoning styles in SDGs. InMind enhances structured gameplay data with round-level strategy traces and post-game reflections, collected under both Observer and Participant modes. It supports four cognitively motivated tasks that jointly evaluate both static alignment and dynamic adaptation. As a case study, we apply InMind to the game Avalon, evaluating 11 state-of-the-art LLMs. General-purpose LLMs, even GPT-4o frequently rely on lexical cues, struggling to anchor reflections in temporal gameplay or adapt to evolving strategies. In contrast, reasoning-enhanced LLMs like DeepSeek-R1 exhibit early signs of style-sensitive reasoning. These findings reveal key limitations in current LLMs' capacity for individualized, adaptive reasoning, and position InMind as a step toward cognitively aligned human-AI interaction.

InMind：評估大型語言模型在捕捉與應用個體人類推理風格上的表現

InMind: Evaluating LLMs in Capturing and Applying Individual Human Reasoning Styles

摘要

Support