合成数据强化学习：任务定义即所需全部

摘要

强化学习（RL）是一种将基础模型适配到特定任务的有效方法，但其对大规模人工标注数据的依赖限制了广泛应用。我们提出了合成数据强化学习（Synthetic Data RL），这是一个简单且通用的框架，仅利用任务定义生成的合成数据进行模型强化微调。我们的方法首先从任务定义和检索文档中生成问答对，然后根据模型的可解性调整问题难度，并通过模型在样本上的平均通过率选择问题用于RL训练。在Qwen-2.5-7B模型上，我们的方法在GSM8K数据集上相比基础模型实现了29.2%的绝对提升（相较于指令微调提升2.9个百分点，相较于Self-Instruct提升6.6个百分点），在MATH数据集上提升8.7%，在GPQA数据集上提升13.1%（相较于SynthLLM提升7.0个百分点），在MedQA数据集上提升8.9%，在法律领域的CQA数据集上提升17.7%，在金融领域的CFA数据集上提升13.7%。在相同数据预算下，它超越了监督微调，并在多个数据集上几乎达到了使用全量人工数据的RL效果（例如，在GSM8K上提升17.2个百分点）。添加100个人工示范仅使GSM8K的性能提升0.4个百分点，显示出有限的附加价值。通过减少人工数据标注，合成数据强化学习实现了可扩展且高效的基于RL的模型适配。代码和演示可在https://github.com/gydpku/Data_Synthesis_RL/获取。

English

Reinforcement learning (RL) is a powerful way to adapt foundation models to specialized tasks, but its reliance on large-scale human-labeled data limits broad adoption. We introduce Synthetic Data RL, a simple and general framework that reinforcement fine-tunes models using only synthetic data generated from a task definition. Our method first generates question and answer pairs from the task definition and retrieved documents, then adapts the difficulty of the question based on model solvability, and selects questions using the average pass rate of the model across samples for RL training. On Qwen-2.5-7B, our method achieves a 29.2% absolute improvement over the base model on GSM8K (+2.9 pp vs. instruction-tuned, +6.6 pp vs. Self-Instruct), 8.7% on MATH, 13.1% on GPQA (+7.0 pp vs. SynthLLM), 8.9% on MedQA, 17.7% on CQA (law) and 13.7% on CFA (finance). It surpasses supervised fine-tuning under the same data budget and nearly matches RL with full human data across datasets (e.g., +17.2 pp on GSM8K). Adding 100 human demonstrations improves the performance of GSM8K only by 0.4 pp, showing a limited added value. By reducing human data annotation, Synthetic Data RL enables scalable and efficient RL-based model adaptation. Code and demos are available at https://github.com/gydpku/Data_Synthesis_RL/.

合成数据强化学习：任务定义即所需全部

Synthetic Data RL: Task Definition Is All You Need

摘要

Support