軟性標記，硬性真相

摘要

在大型語言模型（LLM）的思維鏈（Chain-of-Thought, CoT）推理階段，使用連續而非離散的token最近引起了廣泛關注。這一做法基於一種直覺，即連續的離散token混合可以模擬多條推理路徑的疊加。理論研究已正式證明，連續token具有更高的表達能力，並且能更高效地解決特定問題。然而，連續token的實際應用因訓練難度大而受限：先前的研究要么僅在預訓練的離散token模型上於推理時使用連續token，要么必須從真實的離散CoT中蒸餾出連續CoT，並面臨計算成本高企的問題，這限制了CoT只能使用極少數的token。本研究首次引入了一種可擴展的方法，通過強化學習（RL）來學習連續CoT，而無需從參考的離散CoT中蒸餾。我們使用“軟”token：即token的混合體，並在輸入嵌入中加入噪聲以提供RL探索。計算開銷極小，使我們能夠學習包含數百個token的連續CoT。在Llama和Qwen模型（最高8B）的數學推理基準測試中，使用連續CoT進行訓練在pass@1上與離散token CoT持平，並在pass@32上超越後者，顯示出更高的CoT多樣性。在系統性比較中，表現最佳的場景是使用連續CoT token進行訓練，然後在推理時使用離散token，這意味著“軟”模型可以以標準方式部署。最後，我們展示了連續CoT RL訓練能更好地保留基礎模型在域外任務上的預測，從而為基礎模型提供了更為溫和的調整。

English

The use of continuous instead of discrete tokens during the Chain-of-Thought (CoT) phase of reasoning LLMs has garnered attention recently, based on the intuition that a continuous mixture of discrete tokens could simulate a superposition of several reasoning paths simultaneously. Theoretical results have formally proven that continuous tokens have much greater expressivity and can solve specific problems more efficiently. However, practical use of continuous tokens has been limited by strong training difficulties: previous works either just use continuous tokens at inference time on a pre-trained discrete-token model, or must distill the continuous CoT from ground-truth discrete CoTs and face computational costs that limit the CoT to very few tokens. This is the first work introducing a scalable method to learn continuous CoTs via reinforcement learning (RL), without distilling from reference discrete CoTs. We use "soft" tokens: mixtures of tokens together with noise on the input embedding to provide RL exploration. Computational overhead is minimal, enabling us to learn continuous CoTs with hundreds of tokens. On math reasoning benchmarks with Llama and Qwen models up to 8B, training with continuous CoTs match discrete-token CoTs for pass@1 and surpass them for pass@32, showing greater CoT diversity. In systematic comparisons, the best-performing scenario is to train with continuous CoT tokens then use discrete tokens for inference, meaning the "soft" models can be deployed in a standard way. Finally, we show continuous CoT RL training better preserves the predictions of the base model on out-of-domain tasks, thus providing a softer touch to the base model.

軟性標記，硬性真相

Soft Tokens, Hard Truths

摘要

Support