言語モデルにコードで思考させる

要旨

ツール統合推論（TIR）は、自然言語（NL）推論とコード実行を組み合わせた、言語モデルにおける数学的問題解決の支配的なパラダイムとして登場した。しかしながら、このインターリーブ方式には3つの主要な制限がある。すなわち、コードは事後検証器として機能することが多く、中間的なNL計算はエラーを起こしやすく、NLとコードは明確に区別された役割ではなく重複した役割を果たす。我々はThinC（Thinking in Code）を提案する。これは、コード自体が推論器として機能し、NLによって呼び出されるツールとしてではないフレームワークである。ThinCの軌跡は、短いNL計画ステップで始まり、その後、すべての推論は実行出力のみで接続されたコードブロックを通じて展開される。我々は教師モデルから12.2kのコード中心の軌跡を抽出し、ThinC-1.7BとThinC-4Bを教師ありファインチューニングとそれに続く強化学習で訓練する。ThinC-4Bは、5つの競技レベルの数学ベンチマークにおいて、すべてのTIRベースラインを一貫して上回り、はるかに大規模なQwen3-235B-A22B-Thinkingさえも上回る。さらなる分析により、ThinCがコードを通じて推論することが示される。最終回答の99.2%はインタプリタ出力に基づいており、モデルは中間的なNL推論なしにコード実行の失敗から確実に回復する。我々のコードとモデルは近日中に公開される予定である。

English

Tool-integrated reasoning (TIR) has emerged as a dominant paradigm for mathematical problem solving in language models, combining natural language (NL) reasoning with code execution. However, this interleaved setup has three key limitations: code often acts as a post-hoc verifier, intermediate NL computations are error-prone, and NL and code play overlapping rather than clearly distinct roles. We propose ThinC (Thinking in Code), a framework in which code itself serves as the reasoner rather than as a tool invoked by NL. A ThinC trajectory begins with a brief NL planning step, after which all reasoning unfolds through code blocks connected only by their execution outputs. We distill 12.2k code-centric trajectories from a teacher model and train ThinC-1.7B and ThinC-4B with supervised fine-tuning followed by reinforcement learning. ThinC-4B consistently outperforms every TIR baseline on five competition-level math benchmarks and even surpasses the much larger Qwen3-235B-A22B-Thinking. Further analysis shows that ThinC reasons through code: 99.2% of its final answers are grounded in interpreter output, and the model recovers reliably from code execution failures without intermediate NL reasoning. Our code and models will be released soon.

言語モデルにコードで思考させる

Teaching Language Models to Think in Code

要旨

Support