CurES: 勾配分析から推論LLMのための効率的なカリキュラム学習へ

要旨

カリキュラム学習は、大規模言語モデル（LLM）の推論タスクにおける学習効率を向上させる上で重要な役割を果たします。しかし、既存の手法では、プロンプトの難易度の変動を十分に考慮できていないか、狭い基準範囲内でプロンプトデータセットを選択するための単純なフィルタリング機構に依存していることが多く、結果として大幅な計算リソースの浪費を招いています。本研究では、強化学習の勾配最適化の観点からこの問題にアプローチし、LLMの学習効率を向上させる方法について体系的かつ理論的な調査を行います。我々は、学習効率に影響を与える2つの主要な要因を特定しました：学習プロンプトの選択と、異なるプロンプト間でのロールアウト量の割り当てです。理論分析により、プロンプトのサンプリング分布が勾配降下法の収束速度を決定し、ロールアウト量の割り当てが全体の勾配更新の一貫性と安定性に影響を与えることが明らかになりました。これらの知見に基づき、我々はCurESを提案します。これは、収束を加速し、ベイズ事後推定を用いて計算オーバーヘッドを最小化する効率的な学習手法です。実験結果は、CurESがGroup Relative Policy Optimization（GRPO）を1.5Bモデルで+3.30ポイント、7Bモデルで+4.82ポイント上回ることを示しています。さらに、CurESはGRPOを含むベースラインと比較してより速い収束を示します。

English

Curriculum learning plays a crucial role in enhancing the training efficiency of large language models (LLMs) on reasoning tasks. However, existing methods often fail to adequately account for variations in prompt difficulty or rely on simplistic filtering mechanisms to select prompt datasets within a narrow criterion range, resulting in significant computational waste. In this work, we approach the problem from the perspective of reinforcement learning gradient optimization, offering a systematic and theoretical investigation into how to improve the training efficiency of LLMs. We identify two key factors influencing training efficiency: the selection of training prompts and the allocation of rollout quantities across different prompts. Our theoretical analysis reveals that the sampling distribution of prompts dictates the convergence rate of gradient descent, while the allocation of the rollout quantity influences the consistency and stability of overall gradient updates. Based on these insights, we propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead. Experiments demonstrate that our CurES outperforms Group Relative Policy Optimization (GRPO) by +3.30 points and +4.82 points with 1.5B and 7B models, respectively. Additionally, CurES exhibits faster convergence compared to baselines, including GRPO.

CurES: 勾配分析から推論LLMのための効率的なカリキュラム学習へ

CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs

要旨

Support