FastCuRL:通过渐进式上下文扩展实现课程强化学习,高效训练类R1推理模型
FastCuRL: Curriculum Reinforcement Learning with Progressive Context Extension for Efficient Training R1-like Reasoning Models
March 21, 2025
作者: Mingyang Song, Mao Zheng, Zheng Li, Wenjie Yang, Xuan Luo, Yue Pan, Feng Zhang
cs.AI
摘要
本文提出了一种名为\textsc{FastCuRL}的简洁高效课程强化学习方法,该方法结合了上下文窗口扩展策略,旨在加速R1类推理模型的强化学习训练效率,并提升其在处理具有长链思维推理的复杂任务时的表现,特别是在一个15亿参数的语言模型上。\textsc{FastCuRL}包含两个主要步骤:基于长度的训练数据分割和上下文窗口扩展训练。具体而言,前者首先根据输入提示的长度将原始训练数据划分为三个不同级别,随后后者利用分段训练数据集,通过逐步增加上下文窗口长度来训练推理模型。实验结果表明,\textsc{FastCuRL}-1.5B-Preview在仅使用50%训练步骤的情况下,在包括MATH 500、AIME 2024、AMC 2023、Minerva Math和OlympiadBench在内的五个数据集上均超越了DeepScaleR-1.5B-Preview。此外,FastCuRL-1.5B-Preview的所有训练阶段仅需使用配备8个GPU的单一节点即可完成。
English
In this paper, we propose \textsc{FastCuRL}, a simple yet efficient
Curriculum Reinforcement Learning approach with
context window extending strategy to accelerate the reinforcement learning
training efficiency for R1-like reasoning models while enhancing their
performance in tackling complex reasoning tasks with long chain-of-thought
rationales, particularly with a 1.5B parameter language model.
\textsc{FastCuRL} consists of two main procedures: length-aware
training data segmentation and context window extension training. Specifically,
the former first splits the original training data into three different levels
by the input prompt length, and then the latter leverages segmented training
datasets with a progressively increasing context window length to train the
reasoning model. Experimental results demonstrate that
\textsc{FastCuRL}-1.5B-Preview surpasses DeepScaleR-1.5B-Preview
across all five datasets (including MATH 500, AIME 2024, AMC 2023, Minerva
Math, and OlympiadBench) while only utilizing 50\% of training steps.
Furthermore, all training stages for FastCuRL-1.5B-Preview are completed using
just a single node with 8 GPUs.Summary
AI-Generated Summary