思维假象:通过问题复杂性视角理解推理模型的优势与局限
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
June 7, 2025
作者: Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, Mehrdad Farajtabar
cs.AI
摘要
近几代语言模型引入了大型推理模型(LRMs),这些模型在提供答案之前会生成详细的思考过程。尽管这些模型在推理基准测试中表现出性能提升,但其基本能力、扩展特性及局限性仍未被充分理解。当前的评估主要集中于既有的数学和编程基准,强调最终答案的准确性。然而,这种评估模式常受污染影响,且未能深入洞察推理轨迹。在本研究中,我们借助可控的谜题环境系统性地探讨了这些不足,这些环境允许精确操控复杂度,同时保持一致的逻辑结构。这一设置不仅能够分析最终答案,还能剖析内部推理轨迹,从而揭示LRMs的思考方式。通过大量实验,我们发现LRMs在超过一定复杂度后会出现准确性的全面崩溃。此外,它们展现了一种反直觉的扩展极限:随着问题复杂度的增加,其推理努力先增后减,尽管仍有剩余的token预算。通过在同一推理计算条件下比较LRMs与标准LLM,我们识别出三种性能区域:(1) 低复杂度任务中,标准模型优于LRMs;(2) 中等复杂度任务中,LRMs显示出优势;(3) 高复杂度任务中,两种模型均面临全面崩溃。我们发现LRMs在精确计算方面存在局限:它们无法运用明确算法,且在不同规模上推理不一致。我们还更深入地研究了推理轨迹,分析了模型探索解决方案的模式及其计算行为,从而揭示了它们的优势与局限,并对其推理能力提出了疑问。
English
Recent generations of language models have introduced Large Reasoning Models
(LRMs) that generate detailed thinking processes before providing answers.
While these models demonstrate improved performance on reasoning benchmarks,
their fundamental capabilities, scaling properties, and limitations remain
insufficiently understood. Current evaluations primarily focus on established
math and coding benchmarks, emphasizing final answer accuracy. However, this
evaluation paradigm often suffers from contamination and does not provide
insights into the reasoning traces. In this work, we systematically investigate
these gaps with the help of controllable puzzle environments that allow precise
manipulation of complexity while maintaining consistent logical structures.
This setup enables the analysis of not only final answers but also the internal
reasoning traces, offering insights into how LRMs think. Through extensive
experiments, we show that LRMs face a complete accuracy collapse beyond certain
complexities. Moreover, they exhibit a counterintuitive scaling limit: their
reasoning effort increases with problem complexity up to a point, then declines
despite having remaining token budget. By comparing LRMs with their standard
LLM counterparts under same inference compute, we identify three performance
regimes: (1) low-complexity tasks where standard models outperform LRMs, (2)
medium-complexity tasks where LRMs demonstrates advantage, and (3)
high-complexity tasks where both models face complete collapse. We found that
LRMs have limitations in exact computation: they fail to use explicit
algorithms and reason inconsistently across scales. We also investigate the
reasoning traces in more depth, studying the patterns of explored solutions and
analyzing the models' computational behavior, shedding light on their
strengths, limitations, and raising questions about their reasoning
capabilities.