CurES: Van gradiëntanalyse naar efficiënt curriculumleren voor redenerende LLM's

Samenvatting

Curriculum learning speelt een cruciale rol bij het verbeteren van de trainings efficiëntie van grote taalmodellen (LLMs) voor redeneertaken. Bestaande methoden houden echter vaak onvoldoende rekening met variaties in de moeilijkheidsgraad van prompts of vertrouwen op simplistische filtermechanismen om promptdatasets te selecteren binnen een smal criteriumbereik, wat resulteert in aanzienlijk computationeel verlies. In dit werk benaderen we het probleem vanuit het perspectief van reinforcement learning gradientoptimalisatie, waarbij we een systematisch en theoretisch onderzoek bieden naar hoe de trainings efficiëntie van LLMs kan worden verbeterd. We identificeren twee belangrijke factoren die de trainings efficiëntie beïnvloeden: de selectie van trainingsprompts en de toewijzing van rollout-aantallen over verschillende prompts. Onze theoretische analyse toont aan dat de steekproefverdeling van prompts de convergentiesnelheid van gradient descent bepaalt, terwijl de toewijzing van de rollout-aantallen de consistentie en stabiliteit van de algehele gradientupdates beïnvloedt. Op basis van deze inzichten stellen we CurES voor, een efficiënte trainingsmethode die convergentie versnelt en Bayesiaanse posterior-schatting gebruikt om de computationele overhead te minimaliseren. Experimenten tonen aan dat onze CurES Group Relative Policy Optimization (GRPO) overtreft met +3,30 punten en +4,82 punten voor respectievelijk 1,5B en 7B modellen. Daarnaast vertoont CurES een snellere convergentie in vergelijking met baseline-methoden, inclusief GRPO.

English

Curriculum learning plays a crucial role in enhancing the training efficiency of large language models (LLMs) on reasoning tasks. However, existing methods often fail to adequately account for variations in prompt difficulty or rely on simplistic filtering mechanisms to select prompt datasets within a narrow criterion range, resulting in significant computational waste. In this work, we approach the problem from the perspective of reinforcement learning gradient optimization, offering a systematic and theoretical investigation into how to improve the training efficiency of LLMs. We identify two key factors influencing training efficiency: the selection of training prompts and the allocation of rollout quantities across different prompts. Our theoretical analysis reveals that the sampling distribution of prompts dictates the convergence rate of gradient descent, while the allocation of the rollout quantity influences the consistency and stability of overall gradient updates. Based on these insights, we propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead. Experiments demonstrate that our CurES outperforms Group Relative Policy Optimization (GRPO) by +3.30 points and +4.82 points with 1.5B and 7B models, respectively. Additionally, CurES exhibits faster convergence compared to baselines, including GRPO.

CurES: Van gradiëntanalyse naar efficiënt curriculumleren voor redenerende LLM's

CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs

Samenvatting

Support