VCRL: 大規模言語モデルのための分散ベースカリキュラム強化学習

要旨

ポリシーベースの強化学習は現在、数学的推論タスクにおけるLLM（大規模言語モデル）の改善において重要な役割を果たしています。しかし、既存のロールアウトベースの強化学習手法（GRPO、DAPO、GSPOなど）は、異なる難易度のサンプルに対するLLMの学習能力を明示的に考慮しておらず、これは人間の数学的推論タスクにおける易から難への認知プロセスに反しています。直感的に、RLVRにおけるロールアウトグループの報酬の分散が、LLMにとっての現在のサンプルの難易度を部分的に反映していることがわかります。簡単すぎるサンプルや難しすぎるサンプルは分散が低く、適度な難易度のサンプルは分散が高くなります。これに基づいて、我々はVCRLを提案します。これは、グループ報酬の分散に基づいてトレーニングサンプルの難易度を動的に制御するカリキュラム強化学習フレームワークです。5つの数学的ベンチマークと2つのモデルを用いた実験により、VCRLが現在のLLM強化学習ベースラインを上回る利点があることが明らかになりました。

English

Policy-based reinforcement learning currently plays an important role in improving LLMs on mathematical reasoning tasks. However, existing rollout-based reinforcement learning methods (GRPO, DAPO, GSPO, etc.) fail to explicitly consider LLMs' learning ability for samples of different difficulty levels, which is contrary to the human cognitive process of mathematical reasoning tasks from easy to difficult. Intuitively, we find that the variance of the rollout group's reward in RLVR partly reflects the difficulty of the current sample for LLMs. Samples that are too easy or too difficult have a lower variance, while samples with moderate difficulty have a higher variance. Based on this, we propose VCRL, a curriculum reinforcement learning framework that dynamically controls the difficulty of training samples based on the variance of group rewards. Experiments on five mathematical benchmarks and two models reveal the advantages of VCRL over the current LLM RL baselines.

VCRL: 大規模言語モデルのための分散ベースカリキュラム強化学習

VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models

要旨

Support