推理的认知基础及其在大语言模型中的体现 (人类)推理能力建立在多重认知基础之上,主要包括工作记忆的容量限制、模式识别机制、先验知识的结构化存储以及元认知监控能力。工作记忆为推理提供临时信息处理空间,其容量限制直接影响逻辑链条的延伸长度;模式识别使人类能快速调用熟悉的推理图式;长时记忆中的知识网络支持类比推理与归纳推理;而元认知则负责监控推理过程的合理性。 大语言模型通过注意力机制模拟工作记忆功能,其上下文窗口长度对应信息保持能力。海量预训练数据形成的参数化知识可视为结构化知识库,使模型能够进行知识驱动的推理。然而,当前模型在元认知层面仍存在局限,表现为对自身推理路径缺乏显性监控,难以实现人类级别的错误检测与修正能力。这种差异揭示了符号接地问题在高级推理中的核心地位,也指向未来模型需要融合显式推理引擎与隐式知识表示的发展方向。
Cognitive Foundations for Reasoning and Their Manifestation in LLMs
November 20, 2025
作者: Priyanka Kargupta, Shuyue Stella Li, Haocheng Wang, Jinu Lee, Shan Chen, Orevaoghene Ahia, Dean Light, Thomas L. Griffiths, Max Kleiman-Weiner, Jiawei Han, Asli Celikyilmaz, Yulia Tsvetkov
cs.AI
摘要
大型语言模型(LLMs)能够解决复杂问题,却在更简单的变体上表现不佳,这表明其获得正确输出的机制与人类推理存在本质差异。为探究这一差异,我们将认知科学研究整合为包含28种认知要素的分类体系,涵盖推理不变性、元认知控制、组织推理与知识的表征方式以及转换操作。我们构建了细粒度评估框架,首次对来自文本、视觉和音频领域的18个模型产生的19.2万条推理轨迹进行大规模实证分析,并辅以54条人类有声思维轨迹(已公开)。研究发现:模型未能充分利用与成功正相关的认知要素,在处理非结构化问题时僵化为机械的序列化处理,而此类问题恰恰需要多样化表征和元认知监控;人类轨迹展现出更强的抽象与概念处理能力,模型则倾向于表层枚举。对1600篇LLM推理论文的元分析表明,研究界集中于易量化的要素(序列化组织:55%,问题分解:60%),却忽视了与成功相关的元认知控制(自我监控:16%)。模型虽具备与成功相关的行为模式,却无法自主调用。基于这些规律,我们开发了测试时推理引导技术,自动构建成功推理结构,使复杂问题上的性能最高提升66.7%。通过建立认知科学与LLM研究的共同话语体系,本框架既能系统诊断推理失败根源,推动模型从依赖表面捷径转向稳健认知机制的发展,也为大规模验证人类认知理论提供了工具支持。
English
Large language models (LLMs) solve complex problems yet fail on simpler variants, suggesting they achieve correct outputs through mechanisms fundamentally different from human reasoning. To understand this gap, we synthesize cognitive science research into a taxonomy of 28 cognitive elements spanning reasoning invariants, meta-cognitive controls, representations for organizing reasoning & knowledge, and transformation operations. We introduce a fine-grained evaluation framework and conduct the first large-scale empirical analysis of 192K traces from 18 models across text, vision, and audio, complemented by 54 human think-aloud traces, which we make publicly available. We find that models under-utilize cognitive elements correlated with success, narrowing to rigid sequential processing on ill-structured problems where diverse representations and meta-cognitive monitoring are critical. Human traces show more abstraction and conceptual processing, while models default to surface-level enumeration. Meta-analysis of 1.6K LLM reasoning papers reveals the research community concentrates on easily quantifiable elements (sequential organization: 55%, decomposition: 60%) but neglecting meta-cognitive controls (self-awareness: 16%) that correlate with success. Models possess behavioral repertoires associated with success but fail to deploy them spontaneously. Leveraging these patterns, we develop test-time reasoning guidance that automatically scaffold successful structures, improving performance by up to 66.7% on complex problems. By establishing a shared vocabulary between cognitive science and LLM research, our framework enables systematic diagnosis of reasoning failures and principled development of models that reason through robust cognitive mechanisms rather than spurious shortcuts, while providing tools to test theories of human cognition at scale.