自己操舵型言語モデル

要旨

テスト時推論により言語モデルは複雑なタスクに取り組むことが可能になりますが、自然言語での探索や計画立案は遅く、コストがかかり、エラーが発生しやすいという課題があります。しかし、言語モデルが問題を解決するために必要な正確な推論ステップを模倣するのに苦労する場合でも、その抽象的な構造（解決策を検証する方法や探索する方法）を記述することには優れていることが多いです。本論文では、Plannerモデルがタスク固有の推論プログラムを生成し、それをFollowerモデルの集団が実行する「自己操舵」型言語モデルの手法であるDisCIPLを紹介します。このアプローチにより、言語モデルは再帰的な探索手順を記述する能力を獲得し、検証可能で効率的な新たな推論形式を実現します。小規模なFollowerモデル（例：Llama-3.2-1B）を用いて実装した場合、DisCIPLはGPT-4oやo1などの大規模モデルに匹敵し、時にはそれを上回る性能を、難しい制約付き生成タスクで示します。計画立案と実行を分離することで、本手法は高度に並列化されたモンテカルロ推論戦略の設計空間を開拓し、標準的なbest-of-Nサンプリングを上回り、ファインチューニングを必要とせず、既存の言語モデルによって自動的に実装可能です。

English

While test-time reasoning enables language models to tackle complex tasks, searching or planning in natural language can be slow, costly, and error-prone. But even when LMs struggle to emulate the precise reasoning steps needed to solve a problem, they often excel at describing its abstract structure--both how to verify solutions and how to search for them. This paper introduces DisCIPL, a method for "self-steering" LMs where a Planner model generates a task-specific inference program that is executed by a population of Follower models. Our approach equips LMs with the ability to write recursive search procedures that guide LM inference, enabling new forms of verifiable and efficient reasoning. When instantiated with a small Follower (e.g., Llama-3.2-1B), DisCIPL matches (and sometimes outperforms) much larger models, including GPT-4o and o1, on challenging constrained generation tasks. In decoupling planning from execution, our work opens up a design space of highly-parallelized Monte Carlo inference strategies that outperform standard best-of-N sampling, require no finetuning, and can be implemented automatically by existing LMs.

自己操舵型言語モデル

Self-Steering Language Models

要旨

Support