学习持续思考令牌以增强测试时扩展能力

摘要

测试时扩展已成为一种有效方法，通过利用推理时的额外计算资源来提升语言模型性能。近期研究表明，覆盖思维终止标记（例如将“</think>”替换为“Wait”）能够延长推理步骤并提高准确性。在本研究中，我们探讨是否能够学习一个专用的继续思考标记来触发扩展推理。我们在DeepSeek-R1的蒸馏版本中引入了一个单一的学习标记“<|continue-thinking|>”，仅通过强化学习训练其嵌入，同时保持模型权重不变。实验结果显示，与基线模型及使用固定标记（如“Wait”）进行预算强制的测试时扩展方法相比，这一学习标记在标准数学基准测试上实现了更高的准确率。特别是在固定标记方法提升基础模型准确性的情况下，我们的方法取得了更为显著的改进。例如，在GSM8K基准测试中，固定标记方法带来了1.3%的绝对准确率提升，而我们的学习标记方法相较于未使用预算强制的基础模型，则实现了4.2%的改进。

English

Test-time scaling has emerged as an effective approach for improving language model performance by utilizing additional compute at inference time. Recent studies have shown that overriding end-of-thinking tokens (e.g., replacing "</think>" with "Wait") can extend reasoning steps and improve accuracy. In this work, we explore whether a dedicated continue-thinking token can be learned to trigger extended reasoning. We augment a distilled version of DeepSeek-R1 with a single learned "<|continue-thinking|>" token, training only its embedding via reinforcement learning while keeping the model weights frozen. Our experiments show that this learned token achieves improved accuracy on standard math benchmarks compared to both the baseline model and a test-time scaling approach that uses a fixed token (e.g., "Wait") for budget forcing. In particular, we observe that in cases where the fixed-token approach enhances the base model's accuracy, our method achieves a markedly greater improvement. For example, on the GSM8K benchmark, the fixed-token approach yields a 1.3% absolute improvement in accuracy, whereas our learned-token method achieves a 4.2% improvement over the base model that does not use budget forcing.

学习持续思考令牌以增强测试时扩展能力

Learning a Continue-Thinking Token for Enhanced Test-Time Scaling

摘要

Support