思考をツールの使用に置き換えることで、小規模な言語モデルでも推論が可能になる

要旨

近年の進歩により、推論時と学習時の両方で計算リソースをスケールアップする新しい機械学習パラダイムが確立されました。この研究の流れでは、合成デモンストレーションを用いた教師ありファインチューニング（SFT）と検証可能な報酬を用いた強化学習（RLVR）を組み合わせて、大規模言語モデルを訓練し、自然言語で表現される「思考」という形で推論時に追加の計算リソースを消費させます。本論文では、これらのトークンをステートフルなツールとの多段階インタラクショントレースとしてフォーマットすることを提案します。各段階で、ツールの新しい状態がモデルのコンテキストに追加され、モデルはカスタムDSLを介してツールを制御するために必要なトークンを生成します。このアプローチを、誤動作するPythonコードの修復問題でベンチマークし、この制約付きセットアップが経験の高速サンプリングと密度の高い報酬信号を可能にし、最大3Bパラメータのモデルでもタスクに追加の計算リソースを効率的に消費する方法を学習できることを示します。

English

Recent advances have established a new machine learning paradigm based on scaling up compute at inference time as well as at training time. In that line of work, a combination of Supervised Fine-Tuning (SFT) on synthetic demonstrations and Reinforcement Learning with Verifiable Rewards (RLVR) is used for training Large Language Models to expend extra compute during inference in the form of "thoughts" expressed in natural language. In this paper, we propose to instead format these tokens as a multi-turn interaction trace with a stateful tool. At each turn, the new state of the tool is appended to the context of the model, whose job is to generate the tokens necessary to control the tool via a custom DSL. We benchmark this approach on the problem of repairing malfunctioning Python code, and show that this constrained setup allows for faster sampling of experience and a denser reward signal, allowing even models of size up to 3B parameters to learn how to proficiently expend additional compute on the task.

思考をツールの使用に置き換えることで、小規模な言語モデルでも推論が可能になる

Replacing thinking with tool usage enables reasoning in small language models

要旨

Support