사고자: 빠르고 느린 사고를 배우기

초록

최근 연구에 따르면, 대형 언어 모델(LLM)의 추론 능력은 수학 및 코딩과 같은 분야의 질의응답(QA) 작업에 강화 학습(RL)을 적용함으로써 향상될 수 있음이 밝혀졌습니다. 긴 문맥 길이를 가진 LLM은 DeepSeek R1에서 관찰된 자기 수정 행동에서 알 수 있듯이, 검색을 수행하는 방법을 학습할 수 있습니다. 그러나 이러한 검색 행동은 종종 부정확하고 신뢰도가 낮아, 길고 불필요한 응답을 생성하며 직관과 검증의 결함을 드러냅니다. 심리학의 이중 과정 이론(Dual Process Theory)에서 영감을 받아, 우리는 QA 작업에 네 단계를 포함하는 간단한 수정을 제안합니다: 빠른 사고(Fast Thinking) 단계에서는 LLM이 엄격한 토큰 예산 내에서 답변해야 하며, 검증(Verification) 단계에서는 모델이 초기 응답을 평가합니다. 느린 사고(Slow Thinking) 단계에서는 초기 응답을 더 깊이 고민하여 개선하며, 요약(Summarization) 단계에서는 이전 단계의 개선 사항을 정확한 단계로 정제합니다. 우리가 제안한 작업은 Qwen2.5-1.5B의 평균 정확도를 24.9%에서 27.9%로, DeepSeek-R1-Qwen-1.5B의 평균 정확도를 45.9%에서 49.8%로 향상시켰습니다. 특히, Qwen2.5-1.5B의 경우 빠른 사고 모드만으로도 1000개 미만의 토큰을 사용하여 26.8%의 정확도를 달성하며, 상당한 추론 효율성 향상을 보여줍니다. 이러한 결과는 직관과 숙고적 추론이 별개이며 상호 보완적인 시스템으로, 목표 지향적 훈련을 통해 이점을 얻을 수 있음을 시사합니다.

English

Recent studies show that the reasoning capabilities of Large Language Models (LLMs) can be improved by applying Reinforcement Learning (RL) to question-answering (QA) tasks in areas such as math and coding. With a long context length, LLMs may learn to perform search, as indicated by the self-correction behavior observed in DeepSeek R1. However, this search behavior is often imprecise and lacks confidence, resulting in long, redundant responses and highlighting deficiencies in intuition and verification. Inspired by the Dual Process Theory in psychology, we introduce a simple modification to the QA task that includes four stages: Fast Thinking, where the LLM must answer within a strict token budget; Verification, where the model evaluates its initial response; Slow Thinking, where it refines the initial response with more deliberation; and Summarization, where it distills the refinement from the previous stage into precise steps. Our proposed task improves average accuracy from 24.9% to 27.9% for Qwen2.5-1.5B, and from 45.9% to 49.8% for DeepSeek-R1-Qwen-1.5B. Notably, for Qwen2.5-1.5B, the Fast Thinking mode alone achieves 26.8% accuracy using fewer than 1000 tokens, demonstrating substantial inference efficiency gains. These findings suggest that intuition and deliberative reasoning are distinct, complementary systems benefiting from targeted training.

사고자: 빠르고 느린 사고를 배우기

Thinker: Learning to Think Fast and Slow

초록

Support