推理的认知基础及其在大型语言模型中的体现
Cognitive Foundations for Reasoning and Their Manifestation in LLMs
November 20, 2025
作者: Priyanka Kargupta, Shuyue Stella Li, Haocheng Wang, Jinu Lee, Shan Chen, Orevaoghene Ahia, Dean Light, Thomas L. Griffiths, Max Kleiman-Weiner, Jiawei Han, Asli Celikyilmaz, Yulia Tsvetkov
cs.AI
摘要
大型语言模型(LLMs)能解决复杂问题却在简单变体上失败,表明其通过根本不同于人类推理的机制获得正确输出。为理解这一差异,我们将认知科学研究综合为包含28种认知要素的分类体系,涵盖推理不变性、元认知控制、组织推理与知识的表征方式以及转换操作。我们引入细粒度评估框架,首次对来自文本、视觉和音频领域的18个模型产生的19.2万条推理轨迹进行大规模实证分析,并辅以54条人类出声思维轨迹(已公开)。研究发现:模型未能充分利用与成功相关的认知要素,在结构不良问题上退化为僵化的序列化处理,而此类问题恰恰需要多样化表征和元认知监控。人类轨迹展现出更强的抽象化和概念化处理能力,模型则默认采用表层枚举。对1600篇LLM推理论文的元分析表明,研究界集中于易量化的要素(序列化组织:55%,问题分解:60%),却忽视与成功相关的元认知控制(自我监控:16%)。模型虽具备与成功相关的行为模式,却无法自主调用。基于这些规律,我们开发了测试时推理引导技术,自动构建成功推理结构,在复杂问题上将模型性能最高提升66.7%。通过建立认知科学与LLM研究的共享术语体系,我们的框架能系统诊断推理失败原因,推动模型通过稳健认知机制而非表面捷径实现推理,同时为大规模验证人类认知理论提供工具。
English
Large language models (LLMs) solve complex problems yet fail on simpler variants, suggesting they achieve correct outputs through mechanisms fundamentally different from human reasoning. To understand this gap, we synthesize cognitive science research into a taxonomy of 28 cognitive elements spanning reasoning invariants, meta-cognitive controls, representations for organizing reasoning & knowledge, and transformation operations. We introduce a fine-grained evaluation framework and conduct the first large-scale empirical analysis of 192K traces from 18 models across text, vision, and audio, complemented by 54 human think-aloud traces, which we make publicly available. We find that models under-utilize cognitive elements correlated with success, narrowing to rigid sequential processing on ill-structured problems where diverse representations and meta-cognitive monitoring are critical. Human traces show more abstraction and conceptual processing, while models default to surface-level enumeration. Meta-analysis of 1.6K LLM reasoning papers reveals the research community concentrates on easily quantifiable elements (sequential organization: 55%, decomposition: 60%) but neglecting meta-cognitive controls (self-awareness: 16%) that correlate with success. Models possess behavioral repertoires associated with success but fail to deploy them spontaneously. Leveraging these patterns, we develop test-time reasoning guidance that automatically scaffold successful structures, improving performance by up to 66.7% on complex problems. By establishing a shared vocabulary between cognitive science and LLM research, our framework enables systematic diagnosis of reasoning failures and principled development of models that reason through robust cognitive mechanisms rather than spurious shortcuts, while providing tools to test theories of human cognition at scale.