RL con Dati Sintetici: La Definizione del Compito è Tutto Ciò che Serve

Abstract

Il reinforcement learning (RL) è un metodo potente per adattare i modelli di base a compiti specializzati, ma la sua dipendenza da dati su larga scala etichettati da esseri umani ne limita l'adozione diffusa. Introduciamo Synthetic Data RL, un framework semplice e generale che ottimizza i modelli tramite RL utilizzando esclusivamente dati sintetici generati a partire dalla definizione del compito. Il nostro metodo genera prima coppie di domande e risposte dalla definizione del compito e da documenti recuperati, adatta poi la difficoltà della domanda in base alla risolvibilità del modello e seleziona le domande utilizzando il tasso medio di successo del modello su più campioni per l'addestramento RL. Su Qwen-2.5-7B, il nostro metodo ottiene un miglioramento assoluto del 29,2% rispetto al modello base su GSM8K (+2,9 pp rispetto al modello ottimizzato con istruzioni, +6,6 pp rispetto a Self-Instruct), dell'8,7% su MATH, del 13,1% su GPQA (+7,0 pp rispetto a SynthLLM), dell'8,9% su MedQA, del 17,7% su CQA (legge) e del 13,7% su CFA (finanza). Supera l'ottimizzazione supervisionata con lo stesso budget di dati e si avvicina alle prestazioni del RL con dati umani completi su vari dataset (ad esempio, +17,2 pp su GSM8K). L'aggiunta di 100 dimostrazioni umane migliora le prestazioni su GSM8K solo di 0,4 pp, mostrando un valore aggiunto limitato. Riducendo l'annotazione umana dei dati, Synthetic Data RL consente un adattamento scalabile ed efficiente dei modelli basato su RL. Codice e demo sono disponibili su https://github.com/gydpku/Data_Synthesis_RL/.

English

Reinforcement learning (RL) is a powerful way to adapt foundation models to specialized tasks, but its reliance on large-scale human-labeled data limits broad adoption. We introduce Synthetic Data RL, a simple and general framework that reinforcement fine-tunes models using only synthetic data generated from a task definition. Our method first generates question and answer pairs from the task definition and retrieved documents, then adapts the difficulty of the question based on model solvability, and selects questions using the average pass rate of the model across samples for RL training. On Qwen-2.5-7B, our method achieves a 29.2% absolute improvement over the base model on GSM8K (+2.9 pp vs. instruction-tuned, +6.6 pp vs. Self-Instruct), 8.7% on MATH, 13.1% on GPQA (+7.0 pp vs. SynthLLM), 8.9% on MedQA, 17.7% on CQA (law) and 13.7% on CFA (finance). It surpasses supervised fine-tuning under the same data budget and nearly matches RL with full human data across datasets (e.g., +17.2 pp on GSM8K). Adding 100 human demonstrations improves the performance of GSM8K only by 0.4 pp, showing a limited added value. By reducing human data annotation, Synthetic Data RL enables scalable and efficient RL-based model adaptation. Code and demos are available at https://github.com/gydpku/Data_Synthesis_RL/.

RL con Dati Sintetici: La Definizione del Compito è Tutto Ciò che Serve

Synthetic Data RL: Task Definition Is All You Need

Abstract

Support