도구 사용으로 사고를 대체함으로써 소규모 언어 모델에서의 추론이 가능해진다

초록

최근의 발전은 훈련 시간뿐만 아니라 추론 시간에서도 계산 규모를 확장하는 새로운 기계 학습 패러다임을 확립했습니다. 이러한 연구 흐름에서, 합성된 데모에 대한 지도 미세 조정(Supervised Fine-Tuning, SFT)과 검증 가능한 보상을 활용한 강화 학습(Reinforcement Learning with Verifiable Rewards, RLVR)의 조합이 사용되어, 대규모 언어 모델이 추론 과정에서 자연어로 표현된 "생각"의 형태로 추가 계산을 수행하도록 훈련됩니다. 본 논문에서는 이러한 토큰들을 상태를 유지하는 도구와의 다중 턴 상호작용 추적으로 형식화할 것을 제안합니다. 각 턴에서 도구의 새로운 상태는 모델의 컨텍스트에 추가되며, 모델의 역할은 사용자 정의 도메인 특화 언어(Domain-Specific Language, DSL)를 통해 도구를 제어하는 데 필요한 토큰을 생성하는 것입니다. 우리는 이 접근법을 오작동하는 파이썬 코드 수정 문제에 적용하여 벤치마킹하였으며, 이러한 제약된 설정이 경험 샘플링을 더 빠르게 하고 보다 밀도 높은 보상 신호를 제공함으로써, 최대 30억 개의 매개변수를 가진 모델들도 작업에 추가 계산을 능숙하게 수행하는 방법을 학습할 수 있음을 보여줍니다.

English

Recent advances have established a new machine learning paradigm based on scaling up compute at inference time as well as at training time. In that line of work, a combination of Supervised Fine-Tuning (SFT) on synthetic demonstrations and Reinforcement Learning with Verifiable Rewards (RLVR) is used for training Large Language Models to expend extra compute during inference in the form of "thoughts" expressed in natural language. In this paper, we propose to instead format these tokens as a multi-turn interaction trace with a stateful tool. At each turn, the new state of the tool is appended to the context of the model, whose job is to generate the tokens necessary to control the tool via a custom DSL. We benchmark this approach on the problem of repairing malfunctioning Python code, and show that this constrained setup allows for faster sampling of experience and a denser reward signal, allowing even models of size up to 3B parameters to learn how to proficiently expend additional compute on the task.

도구 사용으로 사고를 대체함으로써 소규모 언어 모델에서의 추론이 가능해진다

Replacing thinking with tool usage enables reasoning in small language models

초록

Support