VCRL：基于方差的大规模语言模型课程强化学习

摘要

基于策略的强化学习当前在提升大语言模型（LLM）数学推理任务表现中扮演着重要角色。然而，现有的基于rollout的强化学习方法（如GRPO、DAPO、GSPO等）未能明确考虑LLM对不同难度样本的学习能力，这与人类从易到难逐步掌握数学推理任务的认知过程相悖。直观上，我们发现RLVR中rollout组奖励的方差在一定程度上反映了当前样本对LLM的难度：过于简单或过于困难的样本方差较低，而难度适中的样本则方差较高。基于此，我们提出了VCRL，一种基于组奖励方差动态调控训练样本难度的课程强化学习框架。在五个数学基准测试和两种模型上的实验表明，VCRL相较于现有LLM强化学习基线方法具有显著优势。

English

Policy-based reinforcement learning currently plays an important role in improving LLMs on mathematical reasoning tasks. However, existing rollout-based reinforcement learning methods (GRPO, DAPO, GSPO, etc.) fail to explicitly consider LLMs' learning ability for samples of different difficulty levels, which is contrary to the human cognitive process of mathematical reasoning tasks from easy to difficult. Intuitively, we find that the variance of the rollout group's reward in RLVR partly reflects the difficulty of the current sample for LLMs. Samples that are too easy or too difficult have a lower variance, while samples with moderate difficulty have a higher variance. Based on this, we propose VCRL, a curriculum reinforcement learning framework that dynamically controls the difficulty of training samples based on the variance of group rewards. Experiments on five mathematical benchmarks and two models reveal the advantages of VCRL over the current LLM RL baselines.

VCRL：基于方差的大规模语言模型课程强化学习

VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models

摘要

Support