自己改善する推論者を可能にする認知的振る舞い、あるいは、高度に効果的なSTaRの4つの習慣

要旨

テスト時推論は、熟練した人間の専門家のように、言語モデルが複雑な課題についてより長く慎重に「考える」ことを可能にする強力なパラダイムとして登場しました。強化学習（RL）は検証可能なタスクにおいて言語モデルの自己改善を促進できますが、一部のモデルは大幅な向上を示す一方で、他のモデルはすぐに頭打ちになります。例えば、Countdownゲームにおいて、Qwen-2.5-3BはLlama-3.2-3Bを同じRLトレーニング下で大きく上回ることがわかりました。この差異は重要な疑問を提起します：効果的な自己改善を可能にする本質的な特性とは何か？私たちは、熟練した人間の問題解決者と成功した言語モデルの両方が採用する4つの主要な認知行動――検証、バックトラッキング、サブゴール設定、後方連鎖――を分析することでこの疑問を探るフレームワークを導入します。私たちの研究は、Qwenが自然にこれらの推論行動を示すのに対し、Llamaは当初それらを欠いていることを明らかにしました。制御された行動データセットを用いた体系的な実験では、これらの推論行動を含む例でLlamaを事前に準備することで、RL中に大幅な改善が可能になり、Qwenの性能に匹敵またはそれを上回ることがわかりました。重要なことに、答えの正しさではなく、推論行動の存在が決定的な要因であることが証明されました――適切な推論パターンを含む誤った解で事前準備されたモデルは、正しい解で訓練されたモデルと同等の性能を達成します。最後に、OpenWebMathデータを用いた継続的な事前学習を活用し、推論行動を増幅するようにフィルタリングすることで、LlamaモデルはQwenの自己改善の軌跡に匹敵するようになりました。私たちの発見は、初期の推論行動と改善能力の間に基本的な関係を確立し、なぜ一部の言語モデルが追加の計算を効果的に活用する一方で、他のモデルが頭打ちになるのかを説明します。

English

Test-time inference has emerged as a powerful paradigm for enabling language models to ``think'' longer and more carefully about complex challenges, much like skilled human experts. While reinforcement learning (RL) can drive self-improvement in language models on verifiable tasks, some models exhibit substantial gains while others quickly plateau. For instance, we find that Qwen-2.5-3B far exceeds Llama-3.2-3B under identical RL training for the game of Countdown. This discrepancy raises a critical question: what intrinsic properties enable effective self-improvement? We introduce a framework to investigate this question by analyzing four key cognitive behaviors -- verification, backtracking, subgoal setting, and backward chaining -- that both expert human problem solvers and successful language models employ. Our study reveals that Qwen naturally exhibits these reasoning behaviors, whereas Llama initially lacks them. In systematic experimentation with controlled behavioral datasets, we find that priming Llama with examples containing these reasoning behaviors enables substantial improvements during RL, matching or exceeding Qwen's performance. Importantly, the presence of reasoning behaviors, rather than correctness of answers, proves to be the critical factor -- models primed with incorrect solutions containing proper reasoning patterns achieve comparable performance to those trained on correct solutions. Finally, leveraging continued pretraining with OpenWebMath data, filtered to amplify reasoning behaviors, enables the Llama model to match Qwen's self-improvement trajectory. Our findings establish a fundamental relationship between initial reasoning behaviors and the capacity for improvement, explaining why some language models effectively utilize additional computation while others plateau.

自己改善する推論者を可能にする認知的振る舞い、あるいは、高度に効果的なSTaRの4つの習慣

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

要旨

Support