VCRL: 대규모 언어 모델을 위한 분산 기반 커리큘럼 강화 학습

초록

정책 기반 강화 학습은 현재 수학적 추론 과제에서 대형 언어 모델(LLM)의 성능을 개선하는 데 중요한 역할을 하고 있습니다. 그러나 기존의 롤아웃 기반 강화 학습 방법들(GRPO, DAPO, GSPO 등)은 LLM이 다양한 난이도의 샘플에 대해 학습하는 능력을 명시적으로 고려하지 못하고 있으며, 이는 쉬운 문제에서 어려운 문제로 점진적으로 나아가는 인간의 수학적 추론 인지 과정과는 상반됩니다. 직관적으로, 우리는 RLVR에서 롤아웃 그룹의 보상 분산이 LLM에게 현재 샘플의 난이도를 부분적으로 반영한다는 것을 발견했습니다. 너무 쉬운 샘플이나 너무 어려운 샘플은 분산이 낮은 반면, 중간 정도의 난이도를 가진 샘플은 분산이 더 높았습니다. 이를 바탕으로, 우리는 그룹 보상의 분산을 기반으로 훈련 샘플의 난이도를 동적으로 조절하는 커리큘럼 강화 학습 프레임워크인 VCRL을 제안합니다. 다섯 가지 수학 벤치마크와 두 가지 모델에 대한 실험을 통해 VCRL이 현재의 LLM 강화 학습 베이스라인보다 우수함을 입증했습니다.

English

Policy-based reinforcement learning currently plays an important role in improving LLMs on mathematical reasoning tasks. However, existing rollout-based reinforcement learning methods (GRPO, DAPO, GSPO, etc.) fail to explicitly consider LLMs' learning ability for samples of different difficulty levels, which is contrary to the human cognitive process of mathematical reasoning tasks from easy to difficult. Intuitively, we find that the variance of the rollout group's reward in RLVR partly reflects the difficulty of the current sample for LLMs. Samples that are too easy or too difficult have a lower variance, while samples with moderate difficulty have a higher variance. Based on this, we propose VCRL, a curriculum reinforcement learning framework that dynamically controls the difficulty of training samples based on the variance of group rewards. Experiments on five mathematical benchmarks and two models reveal the advantages of VCRL over the current LLM RL baselines.

VCRL: 대규모 언어 모델을 위한 분산 기반 커리큘럼 강화 학습

VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models

초록

Support