사고의 환상: 문제 복잡성의 렌즈를 통해 추론 모델의 강점과 한계 이해하기

초록

최근 세대의 언어 모델은 답변을 제공하기 전에 상세한 사고 과정을 생성하는 대형 추론 모델(Large Reasoning Models, LRMs)을 도입했습니다. 이러한 모델들은 추론 벤치마크에서 향상된 성능을 보여주지만, 그들의 기본적인 능력, 스케일링 특성 및 한계는 여전히 충분히 이해되지 않고 있습니다. 현재의 평가는 주로 수학 및 코딩 벤치마크에 초점을 맞추며 최종 답변의 정확성을 강조합니다. 그러나 이러한 평가 패러다임은 종종 오염 문제를 겪으며 추론 흔적에 대한 통찰을 제공하지 못합니다. 본 연구에서는 일관된 논리 구조를 유지하면서 복잡성을 정밀하게 조작할 수 있는 제어 가능한 퍼즐 환경을 통해 이러한 격차를 체계적으로 조사합니다. 이 설정은 최종 답변뿐만 아니라 내부 추론 흔적을 분석할 수 있게 하여 LRMs가 어떻게 사고하는지에 대한 통찰을 제공합니다. 광범위한 실험을 통해 우리는 LRMs가 특정 복잡성을 넘어서면 완전한 정확도 붕괴를 겪는다는 것을 보여줍니다. 또한, 그들은 직관에 반하는 스케일링 한계를 보입니다: 문제 복잡성이 증가함에 따라 추론 노력이 어느 정도까지는 증가하지만, 남은 토큰 예산이 있음에도 불구하고 이후에는 감소합니다. 동일한 추론 계산 하에서 LRMs와 표준 LLM을 비교함으로써, 우리는 세 가지 성능 영역을 확인했습니다: (1) 표준 모델이 LRMs를 능가하는 낮은 복잡성 작업, (2) LRMs가 우위를 보이는 중간 복잡성 작업, (3) 두 모델 모두 완전한 붕괴를 겪는 높은 복잡성 작업. 우리는 LRMs가 정확한 계산에 있어 한계가 있음을 발견했습니다: 그들은 명시적인 알고리즘을 사용하지 못하며 스케일 간에 일관되지 않게 추론합니다. 또한, 우리는 추론 흔적을 더 깊이 조사하여 탐색된 솔루션의 패턴을 연구하고 모델의 계산적 행동을 분석함으로써 그들의 강점과 한계를 밝히고, 그들의 추론 능력에 대한 질문을 제기합니다.

English

Recent generations of language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established math and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from contamination and does not provide insights into the reasoning traces. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs think. Through extensive experiments, we show that LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having remaining token budget. By comparing LRMs with their standard LLM counterparts under same inference compute, we identify three performance regimes: (1) low-complexity tasks where standard models outperform LRMs, (2) medium-complexity tasks where LRMs demonstrates advantage, and (3) high-complexity tasks where both models face complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across scales. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models' computational behavior, shedding light on their strengths, limitations, and raising questions about their reasoning capabilities.

사고의 환상: 문제 복잡성의 렌즈를 통해 추론 모델의 강점과 한계 이해하기

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

초록

Support