《思考的幻象:從問題複雜性的視角理解推理模型的優勢與局限》評述
Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
June 10, 2025
作者: C. Opus, A. Lawsen
cs.AI
摘要
Shojaee等人(2025年)報告指出,大型推理模型(LRMs)在超過特定複雜度閾值的規劃謎題上會出現「準確性崩潰」。我們證明,他們的發現主要反映了實驗設計的局限性,而非根本性的推理失敗。我們的分析揭示了三個關鍵問題:(1)河內塔實驗在報告的失敗點上系統性地超出了模型輸出標記的限制,模型在其輸出中明確承認了這些限制;(2)作者的自動化評估框架未能區分推理失敗與實際限制,導致對模型能力的誤判;(3)最令人擔憂的是,他們的河流穿越基準測試中包含了數學上不可能解決的實例(當N > 5時,由於船隻容量不足),然而模型卻因未能解決這些無解問題而被評為失敗。當我們通過請求生成函數而非詳盡的移動列表來控制這些實驗偽影時,初步實驗表明,在多個模型上,先前被報告為完全失敗的河內塔實例顯示出高準確性。這些發現強調了在評估AI推理能力時,謹慎設計實驗的重要性。
English
Shojaee et al. (2025) report that Large Reasoning Models (LRMs) exhibit
"accuracy collapse" on planning puzzles beyond certain complexity thresholds.
We demonstrate that their findings primarily reflect experimental design
limitations rather than fundamental reasoning failures. Our analysis reveals
three critical issues: (1) Tower of Hanoi experiments systematically exceed
model output token limits at reported failure points, with models explicitly
acknowledging these constraints in their outputs; (2) The authors' automated
evaluation framework fails to distinguish between reasoning failures and
practical constraints, leading to misclassification of model capabilities; (3)
Most concerningly, their River Crossing benchmarks include mathematically
impossible instances for N > 5 due to insufficient boat capacity, yet models
are scored as failures for not solving these unsolvable problems. When we
control for these experimental artifacts, by requesting generating functions
instead of exhaustive move lists, preliminary experiments across multiple
models indicate high accuracy on Tower of Hanoi instances previously reported
as complete failures. These findings highlight the importance of careful
experimental design when evaluating AI reasoning capabilities.