教会语言模型用代码思考

摘要

工具集成推理（TIR）已成为语言模型数学问题求解的主流范式，它将自然语言推理与代码执行相结合。然而，这种交错式设置存在三个关键局限：代码常充当事后验证器，中间自然语言计算容易出错，且自然语言与代码的作用重叠而非明确分工。我们提出ThinC（以代码思考）框架，在该框架中，代码本身充当推理器，而非由自然语言调用的工具。ThinC轨迹以简短的NL规划步骤开始，此后所有推理均通过仅由执行输出相连的代码块展开。我们从教师模型中提炼出12.2k条以代码为中心的轨迹，并通过监督微调及后续强化学习训练出ThinC-1.7B和ThinC-4B模型。ThinC-4B在五个竞赛级数学基准测试中持续优于所有TIR基线，甚至超越了规模大得多的Qwen3-235B-A22B-Thinking模型。进一步分析表明，ThinC通过代码进行推理：其最终答案的99.2%基于解释器输出，且模型在代码执行失败时无需中间自然语言推理即可可靠恢复。我们的代码和模型将很快开源。

English

Tool-integrated reasoning (TIR) has emerged as a dominant paradigm for mathematical problem solving in language models, combining natural language (NL) reasoning with code execution. However, this interleaved setup has three key limitations: code often acts as a post-hoc verifier, intermediate NL computations are error-prone, and NL and code play overlapping rather than clearly distinct roles. We propose ThinC (Thinking in Code), a framework in which code itself serves as the reasoner rather than as a tool invoked by NL. A ThinC trajectory begins with a brief NL planning step, after which all reasoning unfolds through code blocks connected only by their execution outputs. We distill 12.2k code-centric trajectories from a teacher model and train ThinC-1.7B and ThinC-4B with supervised fine-tuning followed by reinforcement learning. ThinC-4B consistently outperforms every TIR baseline on five competition-level math benchmarks and even surpasses the much larger Qwen3-235B-A22B-Thinking. Further analysis shows that ThinC reasons through code: 99.2% of its final answers are grounded in interpreter output, and the model recovers reliably from code execution failures without intermediate NL reasoning. Our code and models will be released soon.

教会语言模型用代码思考

Teaching Language Models to Think in Code

摘要

Support