テスト時スケーリングの強化のための継続思考トークンの学習

要旨

テストタイムスケーリングは、推論時に追加の計算リソースを活用することで言語モデルの性能を向上させる効果的な手法として注目されている。最近の研究では、思考終了トークン（例：「</think>」を「Wait」に置き換える）を上書きすることで推論ステップを延長し、精度を向上させることが示されている。本研究では、専用の「継続思考トークン」を学習させ、拡張推論をトリガーできるかどうかを探る。DeepSeek-R1の蒸留版に単一の学習済み「<|continue-thinking|>」トークンを追加し、モデルの重みを固定したまま埋め込みのみを強化学習によって訓練した。実験の結果、この学習済みトークンは、ベースラインモデルや固定トークン（例：「Wait」）を用いたテストタイムスケーリング手法と比較して、標準的な数学ベンチマークで精度の向上を達成した。特に、固定トークン手法がベースモデルの精度を向上させる場合において、本手法はより顕著な改善を示した。例えば、GSM8Kベンチマークでは、固定トークン手法が精度を1.3%向上させたのに対し、本手法は予算強制を行わないベースモデルに対して4.2%の改善を達成した。

English

Test-time scaling has emerged as an effective approach for improving language model performance by utilizing additional compute at inference time. Recent studies have shown that overriding end-of-thinking tokens (e.g., replacing "</think>" with "Wait") can extend reasoning steps and improve accuracy. In this work, we explore whether a dedicated continue-thinking token can be learned to trigger extended reasoning. We augment a distilled version of DeepSeek-R1 with a single learned "<|continue-thinking|>" token, training only its embedding via reinforcement learning while keeping the model weights frozen. Our experiments show that this learned token achieves improved accuracy on standard math benchmarks compared to both the baseline model and a test-time scaling approach that uses a fixed token (e.g., "Wait") for budget forcing. In particular, we observe that in cases where the fixed-token approach enhances the base model's accuracy, our method achieves a markedly greater improvement. For example, on the GSM8K benchmark, the fixed-token approach yields a 1.3% absolute improvement in accuracy, whereas our learned-token method achieves a 4.2% improvement over the base model that does not use budget forcing.

テスト時スケーリングの強化のための継続思考トークンの学習

Learning a Continue-Thinking Token for Enhanced Test-Time Scaling

要旨

Support