教会语言模型用代码思考
Teaching Language Models to Think in Code
May 11, 2026
作者: Hyeon Hwang, Jiwoo Lee, Jaewoo Kang
cs.AI
摘要
工具集成推理(TIR)已成为语言模型数学问题求解的主流范式,它将自然语言推理与代码执行相结合。然而,这种交错式设置存在三个关键局限:代码常充当事后验证器,中间自然语言计算容易出错,且自然语言与代码的作用重叠而非明确分工。我们提出ThinC(以代码思考)框架,在该框架中,代码本身充当推理器,而非由自然语言调用的工具。ThinC轨迹以简短的NL规划步骤开始,此后所有推理均通过仅由执行输出相连的代码块展开。我们从教师模型中提炼出12.2k条以代码为中心的轨迹,并通过监督微调及后续强化学习训练出ThinC-1.7B和ThinC-4B模型。ThinC-4B在五个竞赛级数学基准测试中持续优于所有TIR基线,甚至超越了规模大得多的Qwen3-235B-A22B-Thinking模型。进一步分析表明,ThinC通过代码进行推理:其最终答案的99.2%基于解释器输出,且模型在代码执行失败时无需中间自然语言推理即可可靠恢复。我们的代码和模型将很快开源。
English
Tool-integrated reasoning (TIR) has emerged as a dominant paradigm for mathematical problem solving in language models, combining natural language (NL) reasoning with code execution. However, this interleaved setup has three key limitations: code often acts as a post-hoc verifier, intermediate NL computations are error-prone, and NL and code play overlapping rather than clearly distinct roles. We propose ThinC (Thinking in Code), a framework in which code itself serves as the reasoner rather than as a tool invoked by NL. A ThinC trajectory begins with a brief NL planning step, after which all reasoning unfolds through code blocks connected only by their execution outputs. We distill 12.2k code-centric trajectories from a teacher model and train ThinC-1.7B and ThinC-4B with supervised fine-tuning followed by reinforcement learning. ThinC-4B consistently outperforms every TIR baseline on five competition-level math benchmarks and even surpasses the much larger Qwen3-235B-A22B-Thinking. Further analysis shows that ThinC reasons through code: 99.2% of its final answers are grounded in interpreter output, and the model recovers reliably from code execution failures without intermediate NL reasoning. Our code and models will be released soon.