评《思考的幻象：通过问题复杂性视角理解推理模型的优势与局限》

摘要

Shojaee等人（2025年）报告称，大型推理模型（LRMs）在超过特定复杂度阈值的规划谜题上表现出“准确性崩溃”。我们证明，他们的发现主要反映了实验设计的局限性，而非根本性的推理失败。我们的分析揭示了三个关键问题：（1）河内塔实验在报告的失败点系统地超出了模型输出令牌的限制，模型在其输出中明确承认了这些约束；（2）作者的自动化评估框架未能区分推理失败与实际约束，导致对模型能力的错误分类；（3）最令人担忧的是，他们的渡河基准测试中包含了数学上不可能的情况，当N > 5时，由于船只容量不足，这些问题无法解决，然而模型却因未能解决这些无解问题而被评为失败。当我们通过请求生成函数而非详尽的移动列表来控制这些实验伪影时，跨多个模型的初步实验表明，在先前被报告为完全失败的河内塔实例上，模型表现出高准确性。这些发现强调了在评估AI推理能力时，精心设计实验的重要性。

English

Shojaee et al. (2025) report that Large Reasoning Models (LRMs) exhibit "accuracy collapse" on planning puzzles beyond certain complexity thresholds. We demonstrate that their findings primarily reflect experimental design limitations rather than fundamental reasoning failures. Our analysis reveals three critical issues: (1) Tower of Hanoi experiments systematically exceed model output token limits at reported failure points, with models explicitly acknowledging these constraints in their outputs; (2) The authors' automated evaluation framework fails to distinguish between reasoning failures and practical constraints, leading to misclassification of model capabilities; (3) Most concerningly, their River Crossing benchmarks include mathematically impossible instances for N > 5 due to insufficient boat capacity, yet models are scored as failures for not solving these unsolvable problems. When we control for these experimental artifacts, by requesting generating functions instead of exhaustive move lists, preliminary experiments across multiple models indicate high accuracy on Tower of Hanoi instances previously reported as complete failures. These findings highlight the importance of careful experimental design when evaluating AI reasoning capabilities.

评《思考的幻象：通过问题复杂性视角理解推理模型的优势与局限》

Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

摘要

Support