Apprendimento di un Token di Pensiero Continuo per un Miglioramento del Ridimensionamento in Fase di Test

Abstract

Il ridimensionamento al momento del test è emerso come un approccio efficace per migliorare le prestazioni dei modelli linguistici sfruttando risorse computazionali aggiuntive durante l'inferenza. Studi recenti hanno dimostrato che sovrascrivere i token di fine ragionamento (ad esempio, sostituendo "</think>" con "Wait") può estendere i passaggi di ragionamento e migliorare l'accuratezza. In questo lavoro, esploriamo se sia possibile apprendere un token dedicato per continuare a pensare, in grado di innescare un ragionamento esteso. Abbiamo arricchito una versione distillata di DeepSeek-R1 con un singolo token appreso "<|continue-thinking|>", addestrando solo il suo embedding tramite apprendimento per rinforzo mentre manteniamo congelati i pesi del modello. I nostri esperimenti mostrano che questo token appreso raggiunge un'accuratezza migliore su benchmark matematici standard rispetto sia al modello di base sia a un approccio di ridimensionamento al momento del test che utilizza un token fisso (ad esempio, "Wait") per forzare il budget. In particolare, osserviamo che nei casi in cui l'approccio con token fisso migliora l'accuratezza del modello di base, il nostro metodo ottiene un miglioramento significativamente maggiore. Ad esempio, sul benchmark GSM8K, l'approccio con token fisso produce un miglioramento assoluto dell'1,3% in accuratezza, mentre il nostro metodo con token appreso raggiunge un miglioramento del 4,2% rispetto al modello di base che non utilizza il forzamento del budget.

English

Test-time scaling has emerged as an effective approach for improving language model performance by utilizing additional compute at inference time. Recent studies have shown that overriding end-of-thinking tokens (e.g., replacing "</think>" with "Wait") can extend reasoning steps and improve accuracy. In this work, we explore whether a dedicated continue-thinking token can be learned to trigger extended reasoning. We augment a distilled version of DeepSeek-R1 with a single learned "<|continue-thinking|>" token, training only its embedding via reinforcement learning while keeping the model weights frozen. Our experiments show that this learned token achieves improved accuracy on standard math benchmarks compared to both the baseline model and a test-time scaling approach that uses a fixed token (e.g., "Wait") for budget forcing. In particular, we observe that in cases where the fixed-token approach enhances the base model's accuracy, our method achieves a markedly greater improvement. For example, on the GSM8K benchmark, the fixed-token approach yields a 1.3% absolute improvement in accuracy, whereas our learned-token method achieves a 4.2% improvement over the base model that does not use budget forcing.

Apprendimento di un Token di Pensiero Continuo per un Miglioramento del Ridimensionamento in Fase di Test

Learning a Continue-Thinking Token for Enhanced Test-Time Scaling

Abstract

Support