ChatPaper.aiChatPaper

RLoop:一种基于迭代策略初始化的自改进强化学习框架

RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization

November 6, 2025
作者: Zeng Zhiyuan, Jiashuo Liu, Zhangyue Yin, Ge Zhang, Wenhao Huang, Xipeng Qiu
cs.AI

摘要

尽管可验证奖励的强化学习(RLVR)在训练大型推理模型方面具有强大能力,但其训练动态存在一个关键挑战:强化学习的过拟合问题,即模型虽然能获得训练奖励,却丧失了泛化能力。我们的分析表明,这一现象是由策略的过度专门化以及训练过程中产生的多样化解决方案被灾难性遗忘所驱动的。标准优化方法会丢弃这些宝贵的跨步骤策略多样性。为解决这一问题,我们提出了RLoop——一个基于迭代策略初始化的自我优化框架。该框架将标准训练过程转化为良性循环:首先利用强化学习从给定策略出发探索解空间,随后筛选成功轨迹创建专家数据集。通过拒绝采样微调技术,该数据集被用于优化初始策略,从而为下一轮迭代创造更优越的起点。这种通过迭代重初始化实现的探索与利用循环,有效将瞬时的策略差异转化为稳健的性能提升。实验表明,RLoop能有效缓解遗忘现象,显著提升泛化能力,相较于原始强化学习方法,平均准确率提高9%,pass@32指标提升超过15%。
English
While Reinforcement Learning for Verifiable Rewards (RLVR) is powerful for training large reasoning models, its training dynamics harbor a critical challenge: RL overfitting, where models gain training rewards but lose generalization. Our analysis reveals this is driven by policy over-specialization and catastrophic forgetting of diverse solutions generated during training. Standard optimization discards this valuable inter-step policy diversity. To address this, we introduce RLoop, a self-improving framework built on iterative policy initialization. RLoop transforms the standard training process into a virtuous cycle: it first uses RL to explore the solution space from a given policy, then filters the successful trajectories to create an expert dataset. This dataset is used via Rejection-sampling Fine-Tuning (RFT) to refine the initial policy, creating a superior starting point for the next iteration. This loop of exploration and exploitation via iterative re-initialization effectively converts transient policy variations into robust performance gains. Our experiments show RLoop mitigates forgetting and substantially improves generalization, boosting average accuracy by 9% and pass@32 by over 15% compared to vanilla RL.
PDF82February 7, 2026