思考者：速い思考と遅い思考を学ぶ

要旨

最近の研究によると、大規模言語モデル（LLMs）の推論能力は、数学やコーディングなどの質問応答（QA）タスクに対して強化学習（RL）を適用することで向上させることができる。長いコンテキスト長を持つLLMsは、DeepSeek R1で観察された自己修正行動が示すように、検索を行うことを学習する可能性がある。しかし、この検索行動はしばしば不正確で確信がなく、冗長で長い応答を引き起こし、直感と検証の欠如を浮き彫りにする。心理学における二重過程理論に着想を得て、我々はQAタスクに単純な修正を加え、以下の4つの段階を含めることを提案する：高速思考（Fast Thinking）、ここではLLMは厳格なトークン予算内で回答しなければならない；検証（Verification）、ここではモデルは初期の応答を評価する；低速思考（Slow Thinking）、ここではより慎重に初期の応答を洗練する；要約（Summarization）、ここでは前段階の洗練を正確なステップに凝縮する。提案したタスクにより、Qwen2.5-1.5Bの平均精度は24.9%から27.9%に、DeepSeek-R1-Qwen-1.5Bの平均精度は45.9%から49.8%に向上した。特に、Qwen2.5-1.5Bでは、高速思考モードだけで1000トークン未満を使用して26.8%の精度を達成し、大幅な推論効率の向上を示している。これらの結果は、直感と熟慮的推論が異なる補完的なシステムであり、ターゲットを絞った訓練から恩恵を受けることを示唆している。

English

Recent studies show that the reasoning capabilities of Large Language Models (LLMs) can be improved by applying Reinforcement Learning (RL) to question-answering (QA) tasks in areas such as math and coding. With a long context length, LLMs may learn to perform search, as indicated by the self-correction behavior observed in DeepSeek R1. However, this search behavior is often imprecise and lacks confidence, resulting in long, redundant responses and highlighting deficiencies in intuition and verification. Inspired by the Dual Process Theory in psychology, we introduce a simple modification to the QA task that includes four stages: Fast Thinking, where the LLM must answer within a strict token budget; Verification, where the model evaluates its initial response; Slow Thinking, where it refines the initial response with more deliberation; and Summarization, where it distills the refinement from the previous stage into precise steps. Our proposed task improves average accuracy from 24.9% to 27.9% for Qwen2.5-1.5B, and from 45.9% to 49.8% for DeepSeek-R1-Qwen-1.5B. Notably, for Qwen2.5-1.5B, the Fast Thinking mode alone achieves 26.8% accuracy using fewer than 1000 tokens, demonstrating substantial inference efficiency gains. These findings suggest that intuition and deliberative reasoning are distinct, complementary systems benefiting from targeted training.

思考者：速い思考と遅い思考を学ぶ

Thinker: Learning to Think Fast and Slow

要旨

Support