软令牌，硬真相

摘要

在大型语言模型（LLM）的思维链（CoT）推理阶段，采用连续而非离散的token近来引起了关注，其背后的直觉在于，连续混合的离散token能够模拟多条推理路径的叠加。理论成果已正式证明，连续token具有更强的表达能力，并能更高效地解决特定问题。然而，连续token的实际应用因训练难度大而受限：先前的研究要么仅在预训练的离散token模型上进行推理时使用连续token，要么必须从真实的离散CoT中蒸馏出连续CoT，并面临计算成本高的问题，导致CoT仅限于极少数token。本研究首次引入了一种通过强化学习（RL）学习连续CoT的可扩展方法，无需从参考的离散CoT中蒸馏。我们采用“软”token：即token的混合体与输入嵌入上的噪声相结合，以提供RL探索。计算开销极小，使我们能够学习包含数百个token的连续CoT。在Llama和Qwen模型（最高达8B）的数学推理基准测试中，使用连续CoT训练在pass@1指标上与离散token CoT持平，并在pass@32指标上超越后者，显示出更高的CoT多样性。在系统比较中，表现最佳的场景是使用连续CoT token进行训练，然后在推理时使用离散token，这意味着“软”模型可以以标准方式部署。最后，我们展示了连续CoT RL训练能更好地保留基础模型在域外任务上的预测，从而对基础模型施加了更为温和的影响。

English

The use of continuous instead of discrete tokens during the Chain-of-Thought (CoT) phase of reasoning LLMs has garnered attention recently, based on the intuition that a continuous mixture of discrete tokens could simulate a superposition of several reasoning paths simultaneously. Theoretical results have formally proven that continuous tokens have much greater expressivity and can solve specific problems more efficiently. However, practical use of continuous tokens has been limited by strong training difficulties: previous works either just use continuous tokens at inference time on a pre-trained discrete-token model, or must distill the continuous CoT from ground-truth discrete CoTs and face computational costs that limit the CoT to very few tokens. This is the first work introducing a scalable method to learn continuous CoTs via reinforcement learning (RL), without distilling from reference discrete CoTs. We use "soft" tokens: mixtures of tokens together with noise on the input embedding to provide RL exploration. Computational overhead is minimal, enabling us to learn continuous CoTs with hundreds of tokens. On math reasoning benchmarks with Llama and Qwen models up to 8B, training with continuous CoTs match discrete-token CoTs for pass@1 and surpass them for pass@32, showing greater CoT diversity. In systematic comparisons, the best-performing scenario is to train with continuous CoT tokens then use discrete tokens for inference, meaning the "soft" models can be deployed in a standard way. Finally, we show continuous CoT RL training better preserves the predictions of the base model on out-of-domain tasks, thus providing a softer touch to the base model.

软令牌，硬真相

Soft Tokens, Hard Truths

摘要

Support