Ottimizzazione del Calcolo al Momento del Test tramite Fine-Tuning con Meta Apprendimento per Rinforzo

Abstract

Addestrare modelli per utilizzare efficacemente il calcolo al momento del test è cruciale per migliorare le prestazioni di ragionamento degli LLM. I metodi attuali lo fanno principalmente attraverso il fine-tuning su tracce di ricerca o eseguendo RL con ricompense di risultato 0/1, ma questi approcci sfruttano in modo efficiente il calcolo al momento del test? Questi approcci continuerebbero a scalare man mano che il budget migliora? In questo articolo, cerchiamo di rispondere a queste domande. Formalizziamo il problema dell'ottimizzazione del calcolo al momento del test come un problema di meta-reinforcement learning (RL), che fornisce una prospettiva principiata sull'utilizzo del calcolo al momento del test. Questa prospettiva ci permette di vedere il lungo flusso di output dell'LLM come composto da diversi episodi eseguiti al momento del test e ci porta a utilizzare una nozione di regret cumulativo sui token di output come modo per misurare l'efficacia del calcolo al momento del test. Analogamente a come gli algoritmi di RL possono bilanciare al meglio esplorazione e sfruttamento durante l'addestramento, minimizzare il regret cumulativo fornirebbe anche il miglior equilibrio tra esplorazione e sfruttamento nel flusso di token. Mentre mostriamo che i modelli all'avanguardia non minimizzano il regret, è possibile farlo massimizzando una ricompensa densa in aggiunta alla ricompensa di risultato 0/1 RL. Questo bonus è il "progresso" fatto da ogni blocco successivo nel flusso di output, quantificato dal cambiamento nella probabilità di successo finale. Utilizzando queste intuizioni, sviluppiamo il Meta Reinforcement Fine-Tuning, o MRT, una nuova classe di metodi di fine-tuning per ottimizzare il calcolo al momento del test. MRT porta a un guadagno relativo di 2-3x nelle prestazioni e a un guadagno di circa 1.5x nell'efficienza dei token per il ragionamento matematico rispetto al RL con ricompensa di risultato.

English

Training models to effectively use test-time compute is crucial for improving the reasoning performance of LLMs. Current methods mostly do so via fine-tuning on search traces or running RL with 0/1 outcome reward, but do these approaches efficiently utilize test-time compute? Would these approaches continue to scale as the budget improves? In this paper, we try to answer these questions. We formalize the problem of optimizing test-time compute as a meta-reinforcement learning (RL) problem, which provides a principled perspective on spending test-time compute. This perspective enables us to view the long output stream from the LLM as consisting of several episodes run at test time and leads us to use a notion of cumulative regret over output tokens as a way to measure the efficacy of test-time compute. Akin to how RL algorithms can best tradeoff exploration and exploitation over training, minimizing cumulative regret would also provide the best balance between exploration and exploitation in the token stream. While we show that state-of-the-art models do not minimize regret, one can do so by maximizing a dense reward bonus in conjunction with the outcome 0/1 reward RL. This bonus is the ''progress'' made by each subsequent block in the output stream, quantified by the change in the likelihood of eventual success. Using these insights, we develop Meta Reinforcement Fine-Tuning, or MRT, a new class of fine-tuning methods for optimizing test-time compute. MRT leads to a 2-3x relative gain in performance and roughly a 1.5x gain in token efficiency for math reasoning compared to outcome-reward RL.

Ottimizzazione del Calcolo al Momento del Test tramite Fine-Tuning con Meta Apprendimento per Rinforzo

Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning

Abstract

Support