思考の錯覚：問題の複雑性というレンズを通して推論モデルの強みと限界を理解する

要旨

近年の言語モデルの世代では、回答を提供する前に詳細な思考プロセスを生成する大規模推論モデル（LRM）が導入されています。これらのモデルは推論ベンチマークで改善された性能を示していますが、その基本的な能力、スケーリング特性、および限界についてはまだ十分に理解されていません。現在の評価は主に確立された数学やコーディングのベンチマークに焦点を当てており、最終的な回答の正確性を重視しています。しかし、この評価パラダイムはしばしば汚染の問題に悩まされ、推論のトレースに関する洞察を提供しません。本研究では、複雑さを正確に操作しながら一貫した論理構造を維持できる制御可能なパズル環境を活用して、これらのギャップを体系的に調査します。この設定により、最終的な回答だけでなく内部の推論トレースも分析でき、LRMがどのように思考するかについての洞察が得られます。広範な実験を通じて、LRMが特定の複雑さを超えると完全な精度の崩壊に直面することを示します。さらに、直感に反するスケーリング限界を示します：推論努力は問題の複雑さに応じてある点まで増加しますが、残りのトークン予算があるにもかかわらずその後減少します。同じ推論計算量でLRMと標準的なLLMを比較することで、3つの性能レジームを特定しました：（1）標準モデルがLRMを上回る低複雑度タスク、（2）LRMが優位性を示す中複雑度タスク、（3）両モデルが完全な崩壊に直面する高複雑度タスクです。LRMは正確な計算において限界があることがわかりました：明示的なアルゴリズムを使用できず、スケール間で一貫性のない推論を行います。また、推論トレースをより深く調査し、探索された解決策のパターンを研究し、モデルの計算行動を分析することで、その強みと限界を明らかにし、推論能力に関する疑問を提起します。

English

Recent generations of language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established math and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from contamination and does not provide insights into the reasoning traces. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs think. Through extensive experiments, we show that LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having remaining token budget. By comparing LRMs with their standard LLM counterparts under same inference compute, we identify three performance regimes: (1) low-complexity tasks where standard models outperform LRMs, (2) medium-complexity tasks where LRMs demonstrates advantage, and (3) high-complexity tasks where both models face complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across scales. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models' computational behavior, shedding light on their strengths, limitations, and raising questions about their reasoning capabilities.

思考の錯覚：問題の複雑性というレンズを通して推論モデルの強みと限界を理解する

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

要旨

Support