思考者：学习快速与慢速思维

摘要

近期研究表明，通过将强化学习（RL）应用于数学和编程等领域的问答（QA）任务，大型语言模型（LLMs）的推理能力可以得到提升。在较长的上下文环境中，LLMs可能学会执行搜索，这一点从DeepSeek R1中观察到的自我修正行为可见一斑。然而，这种搜索行为往往不够精确且缺乏信心，导致冗长冗余的响应，并凸显了直觉与验证方面的不足。受心理学中双过程理论的启发，我们对QA任务进行了简单修改，引入了四个阶段：快速思考阶段，要求LLM在严格的token预算内作答；验证阶段，模型评估其初始回答；慢速思考阶段，模型以更审慎的态度优化初始回答；以及总结阶段，将前一阶段的优化提炼为精确步骤。我们提出的任务使Qwen2.5-1.5B的平均准确率从24.9%提升至27.9%，DeepSeek-R1-Qwen-1.5B的准确率从45.9%提升至49.8%。值得注意的是，对于Qwen2.5-1.5B，仅快速思考模式在使用少于1000个token的情况下就达到了26.8%的准确率，显示出显著的推理效率提升。这些发现表明，直觉与深思熟虑的推理是两种截然不同且互补的系统，它们都能从针对性训练中获益。

English

Recent studies show that the reasoning capabilities of Large Language Models (LLMs) can be improved by applying Reinforcement Learning (RL) to question-answering (QA) tasks in areas such as math and coding. With a long context length, LLMs may learn to perform search, as indicated by the self-correction behavior observed in DeepSeek R1. However, this search behavior is often imprecise and lacks confidence, resulting in long, redundant responses and highlighting deficiencies in intuition and verification. Inspired by the Dual Process Theory in psychology, we introduce a simple modification to the QA task that includes four stages: Fast Thinking, where the LLM must answer within a strict token budget; Verification, where the model evaluates its initial response; Slow Thinking, where it refines the initial response with more deliberation; and Summarization, where it distills the refinement from the previous stage into precise steps. Our proposed task improves average accuracy from 24.9% to 27.9% for Qwen2.5-1.5B, and from 45.9% to 49.8% for DeepSeek-R1-Qwen-1.5B. Notably, for Qwen2.5-1.5B, the Fast Thinking mode alone achieves 26.8% accuracy using fewer than 1000 tokens, demonstrating substantial inference efficiency gains. These findings suggest that intuition and deliberative reasoning are distinct, complementary systems benefiting from targeted training.

思考者：学习快速与慢速思维

Thinker: Learning to Think Fast and Slow

摘要

Support