合成データ強化学習：タスク定義こそがすべて

要旨

強化学習（RL）は、基盤モデルを特定のタスクに適応させる強力な手法ですが、大規模な人手によるラベル付きデータに依存するため、広範な採用が制限されています。本論文では、タスク定義から生成された合成データのみを使用してモデルを強化学習で微調整する、シンプルで汎用的なフレームワーク「Synthetic Data RL」を提案します。本手法では、まずタスク定義と取得したドキュメントから質問と回答のペアを生成し、モデルの解決可能性に基づいて質問の難易度を調整し、サンプル全体でのモデルの平均正答率を使用してRLトレーニング用の質問を選択します。Qwen-2.5-7Bにおいて、本手法はGSM8Kでベースモデルに対して29.2%の絶対的な改善（命令チューニング比+2.9ポイント、Self-Instruct比+6.6ポイント）、MATHで8.7%、GPQAで13.1%（SynthLLM比+7.0ポイント）、MedQAで8.9%、CQA（法律）で17.7%、CFA（金融）で13.7%の向上を達成しました。同じデータ予算下での教師あり微調整を上回り、全人手データを用いたRLにほぼ匹敵する結果を示しています（例：GSM8Kで+17.2ポイント）。100件の人手によるデモンストレーションを追加してもGSM8Kの性能は0.4ポイントしか向上せず、追加価値が限定的であることが示されました。人手によるデータアノテーションを削減することで、Synthetic Data RLはスケーラブルで効率的なRLベースのモデル適応を可能にします。コードとデモはhttps://github.com/gydpku/Data_Synthesis_RL/で公開されています。

English

Reinforcement learning (RL) is a powerful way to adapt foundation models to specialized tasks, but its reliance on large-scale human-labeled data limits broad adoption. We introduce Synthetic Data RL, a simple and general framework that reinforcement fine-tunes models using only synthetic data generated from a task definition. Our method first generates question and answer pairs from the task definition and retrieved documents, then adapts the difficulty of the question based on model solvability, and selects questions using the average pass rate of the model across samples for RL training. On Qwen-2.5-7B, our method achieves a 29.2% absolute improvement over the base model on GSM8K (+2.9 pp vs. instruction-tuned, +6.6 pp vs. Self-Instruct), 8.7% on MATH, 13.1% on GPQA (+7.0 pp vs. SynthLLM), 8.9% on MedQA, 17.7% on CQA (law) and 13.7% on CFA (finance). It surpasses supervised fine-tuning under the same data budget and nearly matches RL with full human data across datasets (e.g., +17.2 pp on GSM8K). Adding 100 human demonstrations improves the performance of GSM8K only by 0.4 pp, showing a limited added value. By reducing human data annotation, Synthetic Data RL enables scalable and efficient RL-based model adaptation. Code and demos are available at https://github.com/gydpku/Data_Synthesis_RL/.

合成データ強化学習：タスク定義こそがすべて

Synthetic Data RL: Task Definition Is All You Need

要旨

Support