VCRL:基於方差的大規模語言模型課程強化學習
VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models
September 24, 2025
作者: Guochao Jiang, Wenfeng Feng, Guofeng Quan, Chuzhan Hao, Yuewei Zhang, Guohua Liu, Hao Wang
cs.AI
摘要
基於策略的強化學習目前在提升大型語言模型(LLMs)於數學推理任務上的表現扮演著重要角色。然而,現有的基於rollout的強化學習方法(如GRPO、DAPO、GSPO等)未能明確考慮LLMs對不同難度樣本的學習能力,這與人類從易到難進行數學推理任務的認知過程相悖。直觀上,我們發現RLVR中rollout群組獎勵的變異數部分反映了當前樣本對LLMs的難度。過於簡單或過於困難的樣本變異數較低,而難度適中的樣本則具有較高的變異數。基於此,我們提出了VCRL,這是一個根據群組獎勵變異數動態控制訓練樣本難度的課程強化學習框架。在五個數學基準測試和兩種模型上的實驗揭示了VCRL相較於現有LLM強化學習基線的優勢。
English
Policy-based reinforcement learning currently plays an important role in
improving LLMs on mathematical reasoning tasks. However, existing rollout-based
reinforcement learning methods (GRPO, DAPO, GSPO, etc.) fail to explicitly
consider LLMs' learning ability for samples of different difficulty levels,
which is contrary to the human cognitive process of mathematical reasoning
tasks from easy to difficult. Intuitively, we find that the variance of the
rollout group's reward in RLVR partly reflects the difficulty of the current
sample for LLMs. Samples that are too easy or too difficult have a lower
variance, while samples with moderate difficulty have a higher variance. Based
on this, we propose VCRL, a curriculum reinforcement learning framework that
dynamically controls the difficulty of training samples based on the variance
of group rewards. Experiments on five mathematical benchmarks and two models
reveal the advantages of VCRL over the current LLM RL baselines.