Devolviendo el Valor al RL: Mejor Escalabilidad en Tiempo de Prueba Mediante la Unificación de Razonadores LLM con Verificadores

Resumen

Los métodos predominantes de aprendizaje por refuerzo~(RL) para el ajuste fino de razonadores de modelos de lenguaje grandes (LLM), como GRPO o Leave-one-out PPO, abandonan la función de valor aprendida en favor de retornos estimados empíricamente. Esto dificulta la escalabilidad del cómputo en tiempo de prueba que depende del uso de la función de valor para verificación. En este trabajo, proponemos RL^V, que amplía cualquier método de RL "sin valor" al entrenar conjuntamente el LLM como razonador y verificador generativo utilizando datos generados por RL, añadiendo capacidades de verificación sin un sobrecosto significativo. Empíricamente, RL^V aumenta la precisión en MATH en más de un 20% con muestreo paralelo y permite una escalabilidad del cómputo en tiempo de prueba 8-32 veces más eficiente en comparación con el método base de RL. RL^V también exhibe fuertes capacidades de generalización tanto para tareas de fácil a difícil como para tareas fuera del dominio. Además, RL^V logra un rendimiento 1.2-1.6 veces mayor al escalar conjuntamente el cómputo en tiempo de prueba en paralelo y secuencial con un modelo de razonamiento largo R1.

English

Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners, such as GRPO or Leave-one-out PPO, abandon the learned value function in favor of empirically estimated returns. This hinders test-time compute scaling that relies on using the value-function for verification. In this work, we propose RL^V that augments any ``value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier using RL-generated data, adding verification capabilities without significant overhead. Empirically, RL^V boosts MATH accuracy by over 20\% with parallel sampling and enables 8-32times efficient test-time compute scaling compared to the base RL method. RL^V also exhibits strong generalization capabilities for both easy-to-hard and out-of-domain tasks. Furthermore, RL^V achieves 1.2-1.6times higher performance when jointly scaling parallel and sequential test-time compute with a long reasoning R1 model.

Devolviendo el Valor al RL: Mejor Escalabilidad en Tiempo de Prueba Mediante la Unificación de Razonadores LLM con Verificadores

Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

Resumen

Support