LoopRPT: ループ型言語モデルのための強化学習事前学習

要旨

ループ構造言語モデル（LoopLM）は、内部表現を反復的に洗練させる潜在計算を行うことで、明示的な連鎖思考（CoT）推論に対する有望な代替手段を提供する。しかし、既存の強化学習（RL）パラダイムは主に出力トークンを対象としており、推論が暗黙的に展開するループ構造との間に構造的なミスマッチが生じている。本研究では、LoopLMに特化した強化学習事前学習フレームワークであるLoopRPTを提案する。次トークン予測を次トークン推論タスクとして再定義し、EMA教師参照とノイジーな潜在状態ロールアウトを用いて、強化学習信号を潜在ステップに直接付与する。この定式化により、RLが中間表現を直接形成し、効果的な推論をより少ない反復回数に圧縮することが可能となる。我々はOuroアーキテクチャにおいて、複数のモデル規模でLoopRPTを実装した。結果は、LoopRPTがステップごとの表現品質を一貫して向上させ、精度と計算量のトレードオフにおいてパレート優位性を達成することを示している。特に、難易度の高いトークンにおける顕著な性能向上は、LoopRPTが単なる早期終了の促進ではなく、初期段階の推論能力を強化していることを示唆する。本成果は、LoopLMにおける効率的な潜在推論を学習するための原理的なパラダイムとして、強化学習事前学習の重要性を浮き彫りにする。

English

Looped language models (LoopLMs) perform iterative latent computation to refine internal representations, offering a promising alternative to explicit chain-of-thought (CoT) reasoning. However, existing reinforcement learning (RL) paradigms primarily target output tokens, creating a structural mismatch with looped architectures whose reasoning unfolds implicitly. In this work, we propose LoopRPT, a reinforcement pre-training framework tailored for LoopLMs. By reframing next-token prediction as a next-token reasoning task, LoopRPT assigns reinforcement signals directly to latent steps using an EMA teacher reference and noisy latent rollouts. This formulation enables RL to directly shape intermediate representations, compressing effective reasoning into fewer iterations. We instantiate LoopRPT on the Ouro architecture across multiple model scales. Results demonstrate that LoopRPT consistently improves per-step representation quality, achieving Pareto dominance in accuracy-computation trade-offs. Notably, significant gains on hard tokens indicate that LoopRPT enhances early-stage reasoning rather than merely encouraging premature exits. Our findings highlight reinforcement pre-training as a principled paradigm for learning efficient latent reasoning in LoopLMs.

LoopRPT: ループ型言語モデルのための強化学習事前学習

LoopRPT: Reinforcement Pre-Training for Looped Language Models

要旨

Support