合成數據強化學習:任務定義即為關鍵
Synthetic Data RL: Task Definition Is All You Need
May 18, 2025
作者: Yiduo Guo, Zhen Guo, Chuanwei Huang, Zi-Ang Wang, Zekai Zhang, Haofei Yu, Huishuai Zhang, Yikang Shen
cs.AI
摘要
強化學習(RL)是一種將基礎模型適應於特定任務的強大方法,但其對大規模人工標註數據的依賴限制了廣泛應用。我們提出了合成數據強化學習(Synthetic Data RL),這是一個簡單且通用的框架,僅使用從任務定義生成的合成數據來進行模型的強化微調。我們的方法首先從任務定義和檢索到的文檔生成問答對,然後根據模型的可解性調整問題的難度,並使用模型在樣本上的平均通過率來選擇問題進行RL訓練。在Qwen-2.5-7B上,我們的方法在GSM8K上相較於基礎模型實現了29.2%的絕對提升(+2.9 pp vs. 指令微調,+6.6 pp vs. Self-Instruct),在MATH上提升了8.7%,在GPQA上提升了13.1%(+7.0 pp vs. SynthLLM),在MedQA上提升了8.9%,在CQA(法律)上提升了17.7%,在CFA(金融)上提升了13.7%。它在相同數據預算下超越了監督微調,並在多個數據集上幾乎匹配了使用完整人工數據的RL(例如,在GSM8K上+17.2 pp)。添加100個人類示範僅使GSM8K的性能提升了0.4 pp,顯示出有限的附加價值。通過減少人工數據標註,合成數據強化學習實現了可擴展且高效的基於RL的模型適應。代碼和演示可在https://github.com/gydpku/Data_Synthesis_RL/獲取。
English
Reinforcement learning (RL) is a powerful way to adapt foundation models to
specialized tasks, but its reliance on large-scale human-labeled data limits
broad adoption. We introduce Synthetic Data RL, a simple and general framework
that reinforcement fine-tunes models using only synthetic data generated from a
task definition. Our method first generates question and answer pairs from the
task definition and retrieved documents, then adapts the difficulty of the
question based on model solvability, and selects questions using the average
pass rate of the model across samples for RL training. On Qwen-2.5-7B, our
method achieves a 29.2% absolute improvement over the base model on GSM8K (+2.9
pp vs. instruction-tuned, +6.6 pp vs. Self-Instruct), 8.7% on MATH, 13.1% on
GPQA (+7.0 pp vs. SynthLLM), 8.9% on MedQA, 17.7% on CQA (law) and 13.7% on CFA
(finance). It surpasses supervised fine-tuning under the same data budget and
nearly matches RL with full human data across datasets (e.g., +17.2 pp on
GSM8K). Adding 100 human demonstrations improves the performance of GSM8K only
by 0.4 pp, showing a limited added value. By reducing human data annotation,
Synthetic Data RL enables scalable and efficient RL-based model adaptation.
Code and demos are available at https://github.com/gydpku/Data_Synthesis_RL/.Summary
AI-Generated Summary