VCRL：基於方差的大規模語言模型課程強化學習

摘要

基於策略的強化學習目前在提升大型語言模型（LLMs）於數學推理任務上的表現扮演著重要角色。然而，現有的基於rollout的強化學習方法（如GRPO、DAPO、GSPO等）未能明確考慮LLMs對不同難度樣本的學習能力，這與人類從易到難進行數學推理任務的認知過程相悖。直觀上，我們發現RLVR中rollout群組獎勵的變異數部分反映了當前樣本對LLMs的難度。過於簡單或過於困難的樣本變異數較低，而難度適中的樣本則具有較高的變異數。基於此，我們提出了VCRL，這是一個根據群組獎勵變異數動態控制訓練樣本難度的課程強化學習框架。在五個數學基準測試和兩種模型上的實驗揭示了VCRL相較於現有LLM強化學習基線的優勢。

English

Policy-based reinforcement learning currently plays an important role in improving LLMs on mathematical reasoning tasks. However, existing rollout-based reinforcement learning methods (GRPO, DAPO, GSPO, etc.) fail to explicitly consider LLMs' learning ability for samples of different difficulty levels, which is contrary to the human cognitive process of mathematical reasoning tasks from easy to difficult. Intuitively, we find that the variance of the rollout group's reward in RLVR partly reflects the difficulty of the current sample for LLMs. Samples that are too easy or too difficult have a lower variance, while samples with moderate difficulty have a higher variance. Based on this, we propose VCRL, a curriculum reinforcement learning framework that dynamically controls the difficulty of training samples based on the variance of group rewards. Experiments on five mathematical benchmarks and two models reveal the advantages of VCRL over the current LLM RL baselines.

VCRL：基於方差的大規模語言模型課程強化學習

VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models

摘要

Support