以工具使用取代思考，使小型語言模型也能進行推理

摘要

近期研究確立了一種新的機器學習範式，該範式基於在推理時和訓練時同時擴展計算能力。在這方面的工作中，結合了對合成示範的監督微調（SFT）和帶有可驗證獎勵的強化學習（RLVR），用於訓練大型語言模型在推理時以自然語言表達的「思考」形式消耗額外計算資源。在本論文中，我們提出將這些標記格式化為與有狀態工具的多輪交互軌跡。在每一輪中，工具的新狀態會被附加到模型的上下文中，而模型的任務是生成通過自定義領域特定語言（DSL）控制工具所需的標記。我們在修復故障Python代碼的問題上對這一方法進行了基準測試，並展示了這種受限設置能夠加快經驗採樣速度並提供更密集的獎勵信號，使得即使是參數規模高達3B的模型也能學會如何熟練地在任務上消耗額外計算資源。

English

Recent advances have established a new machine learning paradigm based on scaling up compute at inference time as well as at training time. In that line of work, a combination of Supervised Fine-Tuning (SFT) on synthetic demonstrations and Reinforcement Learning with Verifiable Rewards (RLVR) is used for training Large Language Models to expend extra compute during inference in the form of "thoughts" expressed in natural language. In this paper, we propose to instead format these tokens as a multi-turn interaction trace with a stateful tool. At each turn, the new state of the tool is appended to the context of the model, whose job is to generate the tokens necessary to control the tool via a custom DSL. We benchmark this approach on the problem of repairing malfunctioning Python code, and show that this constrained setup allows for faster sampling of experience and a denser reward signal, allowing even models of size up to 3B parameters to learn how to proficiently expend additional compute on the task.

以工具使用取代思考，使小型語言模型也能進行推理

Replacing thinking with tool usage enables reasoning in small language models

摘要

Support