ChatPaper.aiChatPaper

ReMiT:强化学习引导的中期训练驱动迭代式大语言模型演进

ReMiT: RL-Guided Mid-Training for Iterative LLM Evolution

February 3, 2026
作者: Junjie Huang, Jiarui Qin, Di Yin, Weiwen Liu, Yong Yu, Xing Sun, Weinan Zhang
cs.AI

摘要

当前大型语言模型(LLM)的标准训练流程通常采用从预训练到后训练的单向模式。然而,反向利用后训练成果优化预训练基座模型的双向机制尚未得到探索。我们致力于构建一种自我增强的飞轮效应:通过强化学习微调后的模型能够增强基座模型,而优化后的基座模型又能进一步提升后续后训练效果,且无需依赖特定训练的教师模型或参考模型。为实现这一目标,我们通过分析训练动态,发现模型能力跃升的关键转折点出现在训练中期(退火阶段)。该阶段通常位于预训练尾声,采用高质量语料库配合快速衰减的学习率进行训练。基于此发现,我们提出ReMiT(强化学习引导的中期训练)方法。具体而言,ReMiT利用强化学习微调模型的推理先验,在中期训练阶段动态调整词元权重,优先关注对推理至关重要的词汇。实验表明,ReMiT在数学、编程和通用推理等10项预训练基准任务上平均提升3%,且在后训练全流程中持续保持超过2%的增益。这些结果验证了迭代反馈回路的有效性,为LLM实现持续自我进化提供了新路径。
English
Standard training pipelines for large language models (LLMs) are typically unidirectional, progressing from pre-training to post-training. However, the potential for a bidirectional process--where insights from post-training retroactively improve the pre-trained foundation--remains unexplored. We aim to establish a self-reinforcing flywheel: a cycle in which reinforcement learning (RL)-tuned model strengthens the base model, which in turn enhances subsequent post-training performance, requiring no specially trained teacher or reference model. To realize this, we analyze training dynamics and identify the mid-training (annealing) phase as a critical turning point for model capabilities. This phase typically occurs at the end of pre-training, utilizing high-quality corpora under a rapidly decaying learning rate. Building upon this insight, we introduce ReMiT (Reinforcement Learning-Guided Mid-Training). Specifically, ReMiT leverages the reasoning priors of RL-tuned models to dynamically reweight tokens during the mid-training phase, prioritizing those pivotal for reasoning. Empirically, ReMiT achieves an average improvement of 3\% on 10 pre-training benchmarks, spanning math, code, and general reasoning, and sustains these gains by over 2\% throughout the post-training pipeline. These results validate an iterative feedback loop, enabling continuous and self-reinforcing evolution of LLMs.
PDF63March 16, 2026