LoopRPT：面向循环语言模型的强化预训练框架

摘要

循环语言模型（LoopLMs）通过迭代式潜在计算来优化内部表征，为显式思维链推理提供了一种前景广阔的替代方案。然而现有的强化学习范式主要针对输出词元进行优化，与循环架构中推理过程隐式展开的特性存在结构错配。本研究提出LoopRPT——专为循环语言模型设计的强化预训练框架。通过将下一词元预测重构为下一词元推理任务，LoopRPT采用指数移动平均教师参考和带噪潜在展开机制，直接将强化信号分配给潜在计算步骤。这种设计使强化学习能够直接塑造中间表征，将有效推理压缩至更少迭代次数。我们在不同规模Ouro架构上实例化LoopRPT，实验结果表明该方法能持续提升单步表征质量，在准确率-计算量权衡中实现帕累托占优。值得注意的是，模型在困难词元上的显著提升表明LoopRPT真正增强了早期阶段推理能力，而非仅鼓励提前退出机制。我们的研究证实了强化预训练可作为学习循环语言模型中高效潜在推理的原则性范式。

English

Looped language models (LoopLMs) perform iterative latent computation to refine internal representations, offering a promising alternative to explicit chain-of-thought (CoT) reasoning. However, existing reinforcement learning (RL) paradigms primarily target output tokens, creating a structural mismatch with looped architectures whose reasoning unfolds implicitly. In this work, we propose LoopRPT, a reinforcement pre-training framework tailored for LoopLMs. By reframing next-token prediction as a next-token reasoning task, LoopRPT assigns reinforcement signals directly to latent steps using an EMA teacher reference and noisy latent rollouts. This formulation enables RL to directly shape intermediate representations, compressing effective reasoning into fewer iterations. We instantiate LoopRPT on the Ouro architecture across multiple model scales. Results demonstrate that LoopRPT consistently improves per-step representation quality, achieving Pareto dominance in accuracy-computation trade-offs. Notably, significant gains on hard tokens indicate that LoopRPT enhances early-stage reasoning rather than merely encouraging premature exits. Our findings highlight reinforcement pre-training as a principled paradigm for learning efficient latent reasoning in LoopLMs.

LoopRPT：面向循环语言模型的强化预训练框架

LoopRPT: Reinforcement Pre-Training for Looped Language Models

摘要

Support