ChatPaper.aiChatPaper

思考的幻象:透過問題複雜性的視角理解推理模型的優勢與局限

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

June 7, 2025
作者: Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, Mehrdad Farajtabar
cs.AI

摘要

近幾代的語言模型引入了大型推理模型(LRMs),這些模型在提供答案之前會生成詳細的思考過程。雖然這些模型在推理基準測試中表現出改進的性能,但其基本能力、擴展特性及限制仍未被充分理解。目前的評估主要集中於既有的數學和編程基準,強調最終答案的準確性。然而,這種評估範式常受到污染,且無法提供對推理軌跡的深入洞察。在本研究中,我們借助可控的謎題環境系統性地探討這些差距,這些環境允許精確操控複雜度,同時保持一致的邏輯結構。此設置不僅能分析最終答案,還能剖析內部的推理軌跡,從而揭示LRMs的思考方式。通過大量實驗,我們發現LRMs在超過一定複雜度後會遭遇完全準確性崩潰。此外,它們展現出一種反直覺的擴展限制:其推理努力隨問題複雜度增加至某一點後,儘管仍有剩餘的token預算,卻開始下降。通過在相同推理計算下比較LRMs與其標準LLM對應模型,我們識別出三種性能區間:(1)低複雜度任務中,標準模型優於LRMs,(2)中等複雜度任務中,LRMs展現優勢,(3)高複雜度任務中,兩者均面臨完全崩潰。我們發現LRMs在精確計算方面存在局限:它們無法使用明確的算法,且在不同規模下推理不一致。我們還更深入地研究了推理軌跡,探討了探索解決方案的模式,並分析了模型的計算行為,從而揭示了它們的優勢、限制,並對其推理能力提出了疑問。
English
Recent generations of language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established math and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from contamination and does not provide insights into the reasoning traces. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs think. Through extensive experiments, we show that LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having remaining token budget. By comparing LRMs with their standard LLM counterparts under same inference compute, we identify three performance regimes: (1) low-complexity tasks where standard models outperform LRMs, (2) medium-complexity tasks where LRMs demonstrates advantage, and (3) high-complexity tasks where both models face complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across scales. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models' computational behavior, shedding light on their strengths, limitations, and raising questions about their reasoning capabilities.
PDF112June 10, 2025