教導語言模型以程式碼思考

摘要

工具整合推理（TIR）已成為語言模型在數學問題求解中的主流典範，其結合自然語言（NL）推理與程式碼執行。然而，這種交錯設定存在三個關鍵限制：程式碼經常僅作為事後驗證器、中間階段的NL計算容易出錯，且NL與程式碼扮演的角色重疊而非明確分工。我們提出ThinC（以程式碼思考），此框架中程式碼本身即為推理者，而非由NL調用的工具。ThinC的軌跡始於簡短的NL規劃步驟，之後所有推理皆透過僅由執行輸出連接的程式碼區塊展開。我們從教師模型蒸餾出12.2k條以程式碼為中心的軌跡，並以監督式微調後接強化學習訓練ThinC-1.7B與ThinC-4B模型。ThinC-4B在五個競賽級數學基準上持續優於所有TIR基線，甚至超越規模大得多的Qwen3-235B-A22B-Thinking。進一步分析顯示，ThinC透過程式碼進行推理：其99.2%的最終答案植基於直譯器輸出，且模型能在無需中間NL推理的情況下，可靠地從程式碼執行失敗中恢復。我們的程式碼與模型將於近期公開。

English

Tool-integrated reasoning (TIR) has emerged as a dominant paradigm for mathematical problem solving in language models, combining natural language (NL) reasoning with code execution. However, this interleaved setup has three key limitations: code often acts as a post-hoc verifier, intermediate NL computations are error-prone, and NL and code play overlapping rather than clearly distinct roles. We propose ThinC (Thinking in Code), a framework in which code itself serves as the reasoner rather than as a tool invoked by NL. A ThinC trajectory begins with a brief NL planning step, after which all reasoning unfolds through code blocks connected only by their execution outputs. We distill 12.2k code-centric trajectories from a teacher model and train ThinC-1.7B and ThinC-4B with supervised fine-tuning followed by reinforcement learning. ThinC-4B consistently outperforms every TIR baseline on five competition-level math benchmarks and even surpasses the much larger Qwen3-235B-A22B-Thinking. Further analysis shows that ThinC reasons through code: 99.2% of its final answers are grounded in interpreter output, and the model recovers reliably from code execution failures without intermediate NL reasoning. Our code and models will be released soon.

教導語言模型以程式碼思考

Teaching Language Models to Think in Code

摘要

Support