VCRL: Varianz-basiertes Curriculum Reinforcement Learning für große Sprachmodelle

papers.abstract

Policy-basiertes Reinforcement Learning spielt derzeit eine wichtige Rolle bei der Verbesserung von LLMs (Large Language Models) in mathematischen Denkaufgaben. Allerdings berücksichtigen bestehende rollout-basierte Reinforcement-Learning-Methoden (GRPO, DAPO, GSPO usw.) nicht explizit die Lernfähigkeit von LLMs für Proben unterschiedlicher Schwierigkeitsgrade, was im Widerspruch zum menschlichen kognitiven Prozess bei mathematischen Denkaufgaben steht, der von einfach zu schwierig verläuft. Intuitiv stellen wir fest, dass die Varianz der Belohnung der Rollout-Gruppe in RLVR teilweise die Schwierigkeit der aktuellen Probe für LLMs widerspiegelt. Proben, die zu einfach oder zu schwierig sind, weisen eine geringere Varianz auf, während Proben mit mittlerem Schwierigkeitsgrad eine höhere Varianz aufweisen. Basierend darauf schlagen wir VCRL vor, ein Curriculum-Reinforcement-Learning-Framework, das die Schwierigkeit der Trainingsproben dynamisch auf der Grundlage der Varianz der Gruppenbelohnungen steuert. Experimente auf fünf mathematischen Benchmarks und zwei Modellen zeigen die Vorteile von VCRL gegenüber den aktuellen LLM-RL-Baselines auf.

English

Policy-based reinforcement learning currently plays an important role in improving LLMs on mathematical reasoning tasks. However, existing rollout-based reinforcement learning methods (GRPO, DAPO, GSPO, etc.) fail to explicitly consider LLMs' learning ability for samples of different difficulty levels, which is contrary to the human cognitive process of mathematical reasoning tasks from easy to difficult. Intuitively, we find that the variance of the rollout group's reward in RLVR partly reflects the difficulty of the current sample for LLMs. Samples that are too easy or too difficult have a lower variance, while samples with moderate difficulty have a higher variance. Based on this, we propose VCRL, a curriculum reinforcement learning framework that dynamically controls the difficulty of training samples based on the variance of group rewards. Experiments on five mathematical benchmarks and two models reveal the advantages of VCRL over the current LLM RL baselines.

VCRL: Varianz-basiertes Curriculum Reinforcement Learning für große Sprachmodelle

VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models

papers.abstract

Support