『思考の錯覚：問題の複雑性を通して推論モデルの強みと限界を理解する』へのコメント

要旨

Shojaeeら（2025）は、大規模推論モデル（LRMs）が特定の複雑性閾値を超えた計画パズルにおいて「精度崩壊」を示すと報告している。我々は、彼らの知見が主に実験設計の限界を反映しており、根本的な推論の失敗ではないことを示す。我々の分析から、以下の3つの重要な問題が明らかになった：(1) ハノイの塔の実験では、報告された失敗点においてモデルの出力トークン制限を体系的に超えており、モデルはこれらの制約を出力において明示的に認識している；(2) 著者らの自動評価フレームワークは、推論の失敗と実用的な制約を区別できず、モデルの能力を誤分類している；(3) 最も懸念されるのは、彼らの川渡りベンチマークにおいて、N > 5の場合にボートの容量不足により数学的に不可能なインスタンスが含まれているにもかかわらず、モデルはこれらの解決不能な問題を解けないことに対して失敗と評価されている。これらの実験的アーティファクトを制御するために、網羅的な移動リストではなく生成関数を要求した場合、複数のモデルにおける予備実験では、以前に完全な失敗と報告されたハノイの塔のインスタンスにおいて高い精度が示された。これらの知見は、AIの推論能力を評価する際に慎重な実験設計が重要であることを強調している。

English

Shojaee et al. (2025) report that Large Reasoning Models (LRMs) exhibit "accuracy collapse" on planning puzzles beyond certain complexity thresholds. We demonstrate that their findings primarily reflect experimental design limitations rather than fundamental reasoning failures. Our analysis reveals three critical issues: (1) Tower of Hanoi experiments systematically exceed model output token limits at reported failure points, with models explicitly acknowledging these constraints in their outputs; (2) The authors' automated evaluation framework fails to distinguish between reasoning failures and practical constraints, leading to misclassification of model capabilities; (3) Most concerningly, their River Crossing benchmarks include mathematically impossible instances for N > 5 due to insufficient boat capacity, yet models are scored as failures for not solving these unsolvable problems. When we control for these experimental artifacts, by requesting generating functions instead of exhaustive move lists, preliminary experiments across multiple models indicate high accuracy on Tower of Hanoi instances previously reported as complete failures. These findings highlight the importance of careful experimental design when evaluating AI reasoning capabilities.

『思考の錯覚：問題の複雑性を通して推論モデルの強みと限界を理解する』へのコメント

Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

要旨

Support