ソフトトークン、ハードトゥルース

要旨

連続トークンの使用は、Chain-of-Thought（CoT）推論段階において、離散トークンではなく連続トークンを採用する手法が最近注目を集めている。これは、離散トークンの連続的な混合が、複数の推論経路の重ね合わせを同時にシミュレートできるという直感に基づいている。理論的な結果から、連続トークンは表現力がはるかに高く、特定の問題をより効率的に解決できることが正式に証明されている。しかし、連続トークンの実用的な使用は、強い訓練の困難さによって制限されてきた。これまでの研究では、推論時に事前訓練された離散トークンモデルに連続トークンを使用するか、または連続CoTを基盤となる離散CoTから蒸留しなければならず、計算コストがかかるため、CoTを非常に少数のトークンに限定せざるを得なかった。本研究は、参照となる離散CoTからの蒸留を必要とせず、強化学習（RL）を通じて連続CoTを学習するスケーラブルな方法を初めて導入したものである。我々は「ソフト」トークンを使用する：トークンの混合と入力埋め込みにノイズを加えることで、RLの探索を可能にする。計算オーバーヘッドは最小限であり、数百のトークンを持つ連続CoTを学習することができる。LlamaおよびQwenモデルを用いた数学的推論ベンチマークにおいて、連続CoTによる訓練は、pass@1において離散トークンCoTと同等の性能を示し、pass@32においてそれを上回り、より多様なCoTを生成することが示された。系統的な比較において、最も性能の高いシナリオは、連続CoTトークンで訓練を行い、推論時には離散トークンを使用するものであり、「ソフト」モデルを標準的な方法で展開できることを意味する。最後に、連続CoT RL訓練は、基盤モデルの予測をドメイン外タスクにおいてより良く保持し、基盤モデルに対してより柔らかいアプローチを提供することを示す。

English

The use of continuous instead of discrete tokens during the Chain-of-Thought (CoT) phase of reasoning LLMs has garnered attention recently, based on the intuition that a continuous mixture of discrete tokens could simulate a superposition of several reasoning paths simultaneously. Theoretical results have formally proven that continuous tokens have much greater expressivity and can solve specific problems more efficiently. However, practical use of continuous tokens has been limited by strong training difficulties: previous works either just use continuous tokens at inference time on a pre-trained discrete-token model, or must distill the continuous CoT from ground-truth discrete CoTs and face computational costs that limit the CoT to very few tokens. This is the first work introducing a scalable method to learn continuous CoTs via reinforcement learning (RL), without distilling from reference discrete CoTs. We use "soft" tokens: mixtures of tokens together with noise on the input embedding to provide RL exploration. Computational overhead is minimal, enabling us to learn continuous CoTs with hundreds of tokens. On math reasoning benchmarks with Llama and Qwen models up to 8B, training with continuous CoTs match discrete-token CoTs for pass@1 and surpass them for pass@32, showing greater CoT diversity. In systematic comparisons, the best-performing scenario is to train with continuous CoT tokens then use discrete tokens for inference, meaning the "soft" models can be deployed in a standard way. Finally, we show continuous CoT RL training better preserves the predictions of the base model on out-of-domain tasks, thus providing a softer touch to the base model.

ソフトトークン、ハードトゥルース

Soft Tokens, Hard Truths

要旨

Support