通过工具使用替代思考，使小型语言模型能够进行推理。

摘要

近期研究确立了一种新的机器学习范式，该范式基于在推理时和训练时同步扩展计算资源。在这一研究方向上，结合了基于合成演示的监督微调（SFT）与可验证奖励的强化学习（RLVR），用于训练大型语言模型，使其在推理过程中以自然语言表达的“思考”形式额外消耗计算资源。本文提出，将这些标记格式化为与有状态工具的多轮交互轨迹。在每一轮交互中，工具的新状态会被附加到模型的上下文中，模型的任务是通过自定义领域特定语言（DSL）生成控制工具所需的标记。我们以修复故障Python代码的问题为基准测试了这一方法，结果表明，这种受限设置能够加速经验采样并提供更密集的奖励信号，使得即便是参数规模高达30亿的模型也能学会如何在该任务上熟练地分配额外计算资源。

English

Recent advances have established a new machine learning paradigm based on scaling up compute at inference time as well as at training time. In that line of work, a combination of Supervised Fine-Tuning (SFT) on synthetic demonstrations and Reinforcement Learning with Verifiable Rewards (RLVR) is used for training Large Language Models to expend extra compute during inference in the form of "thoughts" expressed in natural language. In this paper, we propose to instead format these tokens as a multi-turn interaction trace with a stateful tool. At each turn, the new state of the tool is appended to the context of the model, whose job is to generate the tokens necessary to control the tool via a custom DSL. We benchmark this approach on the problem of repairing malfunctioning Python code, and show that this constrained setup allows for faster sampling of experience and a denser reward signal, allowing even models of size up to 3B parameters to learn how to proficiently expend additional compute on the task.

通过工具使用替代思考，使小型语言模型能够进行推理。

Replacing thinking with tool usage enables reasoning in small language models

摘要

Support