通过工具使用替代思考,使小型语言模型能够进行推理。
Replacing thinking with tool usage enables reasoning in small language models
July 7, 2025
作者: Corrado Rainone, Tim Bakker, Roland Memisevic
cs.AI
摘要
近期研究确立了一种新的机器学习范式,该范式基于在推理时和训练时同步扩展计算资源。在这一研究方向上,结合了基于合成演示的监督微调(SFT)与可验证奖励的强化学习(RLVR),用于训练大型语言模型,使其在推理过程中以自然语言表达的“思考”形式额外消耗计算资源。本文提出,将这些标记格式化为与有状态工具的多轮交互轨迹。在每一轮交互中,工具的新状态会被附加到模型的上下文中,模型的任务是通过自定义领域特定语言(DSL)生成控制工具所需的标记。我们以修复故障Python代码的问题为基准测试了这一方法,结果表明,这种受限设置能够加速经验采样并提供更密集的奖励信号,使得即便是参数规模高达30亿的模型也能学会如何在该任务上熟练地分配额外计算资源。
English
Recent advances have established a new machine learning paradigm based on
scaling up compute at inference time as well as at training time. In that line
of work, a combination of Supervised Fine-Tuning (SFT) on synthetic
demonstrations and Reinforcement Learning with Verifiable Rewards (RLVR) is
used for training Large Language Models to expend extra compute during
inference in the form of "thoughts" expressed in natural language. In this
paper, we propose to instead format these tokens as a multi-turn interaction
trace with a stateful tool. At each turn, the new state of the tool is appended
to the context of the model, whose job is to generate the tokens necessary to
control the tool via a custom DSL. We benchmark this approach on the problem of
repairing malfunctioning Python code, and show that this constrained setup
allows for faster sampling of experience and a denser reward signal, allowing
even models of size up to 3B parameters to learn how to proficiently expend
additional compute on the task.