CurES：從梯度分析到高效課程學習——面向推理大語言模型

摘要

課程學習在提升大型語言模型（LLMs）於推理任務上的訓練效率中扮演著關鍵角色。然而，現有方法往往未能充分考慮提示難度的變化，或依賴於簡化的篩選機制來選擇符合狹窄標準範圍的提示數據集，導致顯著的計算資源浪費。在本研究中，我們從強化學習梯度優化的角度出發，提供了一種系統且理論性的探討，旨在提升LLMs的訓練效率。我們識別出影響訓練效率的兩個關鍵因素：訓練提示的選擇與不同提示間rollout數量的分配。我們的理論分析揭示，提示的採樣分佈決定了梯度下降的收斂速度，而rollout數量的分配則影響整體梯度更新的一致性和穩定性。基於這些洞察，我們提出了CurES，一種高效的訓練方法，它加速了收斂過程，並採用貝葉斯後驗估計以最小化計算開銷。實驗結果表明，我們的CurES在1.5B和7B模型上分別比群組相對策略優化（GRPO）高出+3.30分和+4.82分。此外，與包括GRPO在內的基線方法相比，CurES展現出更快的收斂速度。

English

Curriculum learning plays a crucial role in enhancing the training efficiency of large language models (LLMs) on reasoning tasks. However, existing methods often fail to adequately account for variations in prompt difficulty or rely on simplistic filtering mechanisms to select prompt datasets within a narrow criterion range, resulting in significant computational waste. In this work, we approach the problem from the perspective of reinforcement learning gradient optimization, offering a systematic and theoretical investigation into how to improve the training efficiency of LLMs. We identify two key factors influencing training efficiency: the selection of training prompts and the allocation of rollout quantities across different prompts. Our theoretical analysis reveals that the sampling distribution of prompts dictates the convergence rate of gradient descent, while the allocation of the rollout quantity influences the consistency and stability of overall gradient updates. Based on these insights, we propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead. Experiments demonstrate that our CurES outperforms Group Relative Policy Optimization (GRPO) by +3.30 points and +4.82 points with 1.5B and 7B models, respectively. Additionally, CurES exhibits faster convergence compared to baselines, including GRPO.

CurES：從梯度分析到高效課程學習——面向推理大語言模型

CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs

摘要

Support