추론 모델은 사고 없이도 효과적일 수 있다

초록

최근의 대형 언어 모델(LLM)은 주로 생성 과정에 명시적이고 긴 사고 과정을 포함시킴으로써 추론 능력을 크게 향상시켰다. 본 논문에서는 이러한 명시적 사고가 정말 필요한지에 대해 의문을 제기한다. 최첨단 모델인 DeepSeek-R1-Distill-Qwen을 사용하여, 간단한 프롬프팅을 통해 사고 과정을 우회하는 NoThinking 방식이 놀랍도록 효과적일 수 있음을 발견했다. 토큰 수를 통제했을 때, NoThinking은 수학 문제 해결, 형식적 정리 증명, 코딩 등 다양한 7개의 도전적인 추론 데이터셋에서 사고 과정을 포함한 Thinking 방식을 능가했으며, 특히 저예산 환경에서 더 뛰어난 성능을 보였다(예: ACM 23 데이터셋에서 700 토큰 기준 51.3 vs. 28.9). 주목할 만한 점은, NoThinking의 성능은 pass@k에서 k가 증가함에 따라 더욱 경쟁력 있게 변한다는 것이다. 이러한 관찰을 바탕으로, NoThinking을 사용하여 N개의 출력을 독립적으로 생성하고 이를 통합하는 병렬 확장 접근법이 매우 효과적임을 입증했다. 통합 과정에서는 가능한 경우 작업별 검증기를 사용하거나, 신뢰도 기반 선택과 같은 간단한 best-of-N 전략을 적용했다. 우리의 방법은 유사한 지연 시간을 가진 Thinking 기반의 다양한 베이스라인을 능가했으며, 상당히 긴 지연 시간(최대 9배)을 가진 Thinking과도 비슷한 성능을 보였다. 종합적으로, 본 연구는 긴 사고 과정의 필요성에 대한 재고를 촉구함과 동시에, 저예산 환경이나 낮은 지연 시간에서 강력한 추론 성능을 달성하기 위한 병렬 확장 접근법의 경쟁력 있는 기준을 제시한다.

English

Recent LLMs have significantly improved reasoning capabilities, primarily by including an explicit, lengthy Thinking process as part of generation. In this paper, we question whether this explicit thinking is necessary. Using the state-of-the-art DeepSeek-R1-Distill-Qwen, we find that bypassing the thinking process via simple prompting, denoted as NoThinking, can be surprisingly effective. When controlling for the number of tokens, NoThinking outperforms Thinking across a diverse set of seven challenging reasoning datasets--including mathematical problem solving, formal theorem proving, and coding--especially in low-budget settings, e.g., 51.3 vs. 28.9 on ACM 23 with 700 tokens. Notably, the performance of NoThinking becomes more competitive with pass@k as k increases. Building on this observation, we demonstrate that a parallel scaling approach that uses NoThinking to generate N outputs independently and aggregates them is highly effective. For aggregation, we use task-specific verifiers when available, or we apply simple best-of-N strategies such as confidence-based selection. Our method outperforms a range of baselines with similar latency using Thinking, and is comparable to Thinking with significantly longer latency (up to 9x). Together, our research encourages a reconsideration of the necessity of lengthy thinking processes, while also establishing a competitive reference for achieving strong reasoning performance in low-budget settings or at low latency using parallel scaling.

추론 모델은 사고 없이도 효과적일 수 있다

Reasoning Models Can Be Effective Without Thinking

초록

Support