學習持續思考標記以提升測試時擴展能力

摘要

測試時擴展已成為一種有效的方法，通過在推理時利用額外的計算資源來提升語言模型的性能。最近的研究表明，覆蓋思維結束標記（例如，將“</think>”替換為“Wait”）可以延長推理步驟並提高準確性。在本研究中，我們探討是否能夠學習一個專用的繼續思維標記來觸發延長推理。我們在DeepSeek-R1的精簡版本中增加了一個單獨學習的“<|continue-thinking|>”標記，僅通過強化學習訓練其嵌入，同時保持模型權重不變。我們的實驗表明，與基線模型及使用固定標記（如“Wait”）進行預算強制的測試時擴展方法相比，這一學習到的標記在標準數學基準測試上實現了更高的準確性。特別地，我們觀察到，在固定標記方法提升基礎模型準確性的情況下，我們的方法取得了更為顯著的改進。例如，在GSM8K基準測試中，固定標記方法使準確性絕對提升了1.3%，而我們學習到的標記方法相比未使用預算強制的基礎模型，準確性提升了4.2%。

English

Test-time scaling has emerged as an effective approach for improving language model performance by utilizing additional compute at inference time. Recent studies have shown that overriding end-of-thinking tokens (e.g., replacing "</think>" with "Wait") can extend reasoning steps and improve accuracy. In this work, we explore whether a dedicated continue-thinking token can be learned to trigger extended reasoning. We augment a distilled version of DeepSeek-R1 with a single learned "<|continue-thinking|>" token, training only its embedding via reinforcement learning while keeping the model weights frozen. Our experiments show that this learned token achieves improved accuracy on standard math benchmarks compared to both the baseline model and a test-time scaling approach that uses a fixed token (e.g., "Wait") for budget forcing. In particular, we observe that in cases where the fixed-token approach enhances the base model's accuracy, our method achieves a markedly greater improvement. For example, on the GSM8K benchmark, the fixed-token approach yields a 1.3% absolute improvement in accuracy, whereas our learned-token method achieves a 4.2% improvement over the base model that does not use budget forcing.

學習持續思考標記以提升測試時擴展能力

Learning a Continue-Thinking Token for Enhanced Test-Time Scaling

摘要

Support