大型语言模型的推理缺陷

摘要

大型语言模型（LLMs）已展现出卓越的推理能力，在广泛任务中取得显著成果。然而即便在看似简单的场景中，仍持续存在明显的推理缺陷。为系统化理解并解决这些不足，我们首次推出专注于LLM推理失败的综合研究综述。我们提出一种新颖的分类框架，将推理划分为具身与非具身两种类型，后者进一步细分为非形式化（直觉性）推理与形式化（逻辑性）推理。同时，我们沿互补维度将推理失败归为三类：普遍影响下游任务的LLM架构固有缺陷、特定领域显现的应用局限性，以及轻微变动即导致性能波动的鲁棒性问题。针对每类推理失败，我们明确定义、分析现有研究、探究根本原因并提出改进策略。通过整合碎片化研究，本综述为LLM系统性推理弱点提供了结构化视角，为构建更强健、可靠且鲁棒的推理能力指明方向。我们同步发布了LLM推理失败研究资源库（GitHub项目地址：https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failures），为该领域研究提供便捷入口。

English

Large Language Models (LLMs) have exhibited remarkable reasoning capabilities, achieving impressive results across a wide range of tasks. Despite these advances, significant reasoning failures persist, occurring even in seemingly simple scenarios. To systematically understand and address these shortcomings, we present the first comprehensive survey dedicated to reasoning failures in LLMs. We introduce a novel categorization framework that distinguishes reasoning into embodied and non-embodied types, with the latter further subdivided into informal (intuitive) and formal (logical) reasoning. In parallel, we classify reasoning failures along a complementary axis into three types: fundamental failures intrinsic to LLM architectures that broadly affect downstream tasks; application-specific limitations that manifest in particular domains; and robustness issues characterized by inconsistent performance across minor variations. For each reasoning failure, we provide a clear definition, analyze existing studies, explore root causes, and present mitigation strategies. By unifying fragmented research efforts, our survey provides a structured perspective on systemic weaknesses in LLM reasoning, offering valuable insights and guiding future research towards building stronger, more reliable, and robust reasoning capabilities. We additionally release a comprehensive collection of research works on LLM reasoning failures, as a GitHub repository at https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failures, to provide an easy entry point to this area.