LoopRPT：面向循环语言模型的强化预训练技术

摘要

循环语言模型（LoopLMs）通过迭代式潜在计算优化内部表征，为显式思维链推理提供了有前景的替代方案。然而，现有强化学习范式主要针对输出词元，与循环架构中推理过程隐式展开的特性存在结构错配。本研究提出LoopRPT——专为LoopLMs设计的强化预训练框架。通过将下一词元预测重构为下一词元推理任务，该框架采用指数移动平均教师参考和带噪潜在展开策略，直接将强化信号分配给潜在计算步骤。这种设计使强化学习能够直接塑造中间表征，将有效推理压缩至更少迭代次数。我们在不同规模的Ouro架构上实例化LoopRPT，实验结果表明该框架能持续提升单步表征质量，在准确率-计算量权衡中实现帕累托占优。尤其对困难词元的显著提升表明，LoopRPT增强了早期阶段推理能力而非简单促使模型提前退出。本研究证实强化预训练可作为学习LoopLMs高效潜在推理的原则性范式。

English

Looped language models (LoopLMs) perform iterative latent computation to refine internal representations, offering a promising alternative to explicit chain-of-thought (CoT) reasoning. However, existing reinforcement learning (RL) paradigms primarily target output tokens, creating a structural mismatch with looped architectures whose reasoning unfolds implicitly. In this work, we propose LoopRPT, a reinforcement pre-training framework tailored for LoopLMs. By reframing next-token prediction as a next-token reasoning task, LoopRPT assigns reinforcement signals directly to latent steps using an EMA teacher reference and noisy latent rollouts. This formulation enables RL to directly shape intermediate representations, compressing effective reasoning into fewer iterations. We instantiate LoopRPT on the Ouro architecture across multiple model scales. Results demonstrate that LoopRPT consistently improves per-step representation quality, achieving Pareto dominance in accuracy-computation trade-offs. Notably, significant gains on hard tokens indicate that LoopRPT enhances early-stage reasoning rather than merely encouraging premature exits. Our findings highlight reinforcement pre-training as a principled paradigm for learning efficient latent reasoning in LoopLMs.