CurES：从梯度分析到高效课程学习助力推理型大语言模型

摘要

课程学习在提升大语言模型（LLMs）于推理任务上的训练效率中扮演着关键角色。然而，现有方法往往未能充分考虑提示难度的变化，或依赖于简单的筛选机制在狭窄的标准范围内选择提示数据集，导致显著的计算资源浪费。本研究中，我们从强化学习梯度优化的视角切入，系统且理论地探讨了如何提升LLMs的训练效率。我们识别出影响训练效率的两大关键因素：训练提示的选择以及不同提示间rollout数量的分配。理论分析表明，提示的采样分布决定了梯度下降的收敛速度，而rollout数量的分配则影响整体梯度更新的一致性与稳定性。基于这些洞见，我们提出了CurES，一种高效训练方法，它加速了收敛过程，并采用贝叶斯后验估计以最小化计算开销。实验结果显示，CurES在1.5B和7B模型上分别比组相对策略优化（GRPO）高出+3.30分和+4.82分。此外，与包括GRPO在内的基线方法相比，CurES展现了更快的收敛速度。

English

Curriculum learning plays a crucial role in enhancing the training efficiency of large language models (LLMs) on reasoning tasks. However, existing methods often fail to adequately account for variations in prompt difficulty or rely on simplistic filtering mechanisms to select prompt datasets within a narrow criterion range, resulting in significant computational waste. In this work, we approach the problem from the perspective of reinforcement learning gradient optimization, offering a systematic and theoretical investigation into how to improve the training efficiency of LLMs. We identify two key factors influencing training efficiency: the selection of training prompts and the allocation of rollout quantities across different prompts. Our theoretical analysis reveals that the sampling distribution of prompts dictates the convergence rate of gradient descent, while the allocation of the rollout quantity influences the consistency and stability of overall gradient updates. Based on these insights, we propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead. Experiments demonstrate that our CurES outperforms Group Relative Policy Optimization (GRPO) by +3.30 points and +4.82 points with 1.5B and 7B models, respectively. Additionally, CurES exhibits faster convergence compared to baselines, including GRPO.

CurES：从梯度分析到高效课程学习助力推理型大语言模型

CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs

摘要

Support