FastCuRL：通过渐进式上下文扩展实现课程强化学习，高效训练类R1推理模型

摘要

本文提出了一种名为\textsc{FastCuRL}的简洁高效课程强化学习方法，该方法结合了上下文窗口扩展策略，旨在加速R1类推理模型的强化学习训练效率，并提升其在处理具有长链思维推理的复杂任务时的表现，特别是在一个15亿参数的语言模型上。\textsc{FastCuRL}包含两个主要步骤：基于长度的训练数据分割和上下文窗口扩展训练。具体而言，前者首先根据输入提示的长度将原始训练数据划分为三个不同级别，随后后者利用分段训练数据集，通过逐步增加上下文窗口长度来训练推理模型。实验结果表明，\textsc{FastCuRL}-1.5B-Preview在仅使用50%训练步骤的情况下，在包括MATH 500、AIME 2024、AMC 2023、Minerva Math和OlympiadBench在内的五个数据集上均超越了DeepScaleR-1.5B-Preview。此外，FastCuRL-1.5B-Preview的所有训练阶段仅需使用配备8个GPU的单一节点即可完成。

English

In this paper, we propose \textsc{FastCuRL}, a simple yet efficient Curriculum Reinforcement Learning approach with context window extending strategy to accelerate the reinforcement learning training efficiency for R1-like reasoning models while enhancing their performance in tackling complex reasoning tasks with long chain-of-thought rationales, particularly with a 1.5B parameter language model. \textsc{FastCuRL} consists of two main procedures: length-aware training data segmentation and context window extension training. Specifically, the former first splits the original training data into three different levels by the input prompt length, and then the latter leverages segmented training datasets with a progressively increasing context window length to train the reasoning model. Experimental results demonstrate that \textsc{FastCuRL}-1.5B-Preview surpasses DeepScaleR-1.5B-Preview across all five datasets (including MATH 500, AIME 2024, AMC 2023, Minerva Math, and OlympiadBench) while only utilizing 50\% of training steps. Furthermore, all training stages for FastCuRL-1.5B-Preview are completed using just a single node with 8 GPUs.

FastCuRL：通过渐进式上下文扩展实现课程强化学习，高效训练类R1推理模型

FastCuRL: Curriculum Reinforcement Learning with Progressive Context Extension for Efficient Training R1-like Reasoning Models

摘要

Support