隠れ状態再帰の解明：オン方策強化学習による切替可能な潜在推論

要旨

潜在的な連鎖思考は、可視的な推論過程を連続的な隠れ状態の再帰に置き換えることで推論を圧縮するが、既存の定式化は標準的なオンポリシー強化学習（RL）で最適化することが難しく、因果的に解釈するのも困難である。我々の重要な洞察は、単一の明示的な境界トークンのペアが両方の問題を同時に解決できるという点である。離散的な入口と出口のアンカーにより、潜在ブロックが標準的なオンポリシーRLと互換性を持ち、同じアンカーがメカニズム分析の自然な足がかりを提供する。この動機に基づき、我々は切り替え可能な潜在推論フレームワークであるSWITCHを提案する。モデルは<swi>を出力して潜在モードに入り、</swi>を出力して終了する。境界が通常の離散トークンであるため、GRPOポリシー比はすべての決定点で適切に定義される。同じアンカーは、潜在ステップを直接的なプロービングや因果的介入にさらす。我々は、可視から潜在へのカリキュラムと、再帰的な潜在計算を通じて勾配を伝播するSwitch-GRPO目的関数を用いてモデルを訓練する。SWITCHは、同程度の規模で従来の隠れ状態再帰型潜在推論手法を一貫して上回る。境界トークンによるメカニズム分析により、さらに3つの発見が明らかになる。(i) <swi>はスタイル上のアーティファクトではなく、鋭く局在化された学習済み切り替えポリシーである。(ii) それが開く潜在ステップは、不活性なプレースホルダーとして機能するのではなく、問題固有で因果的に重要な計算を実行する。(iii) その計算は入口での単一の隠れ状態遷移に集中している。これらの結果は、隠れ状態再帰型の潜在推論がRLで訓練可能であり、かつ直接的なメカニズム分析が可能であることを示している。それには、オンポリシーRL自体がどのようにモデルを内部から改善するかという分析も含まれる。

English

Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is that a single pair of explicit boundary tokens can address both issues at once: discrete entry and exit anchors make the latent block compatible with standard on-policy RL, and the same anchors offer a natural foothold for mechanistic analysis. Motivated by this, we propose SWITCH, a switchable latent reasoning framework. The model emits <swi> to enter latent mode and </swi> to exit. Because the boundaries are ordinary discrete tokens, the GRPO policy ratio is well-defined at every decision point. The same anchors also expose the latent steps to direct probing and causal intervention. We train the model with a visible-to-latent curriculum and a Switch-GRPO objective that propagates gradients through recurrent latent computation. SWITCH consistently outperforms prior hidden-state-recurrence latent reasoning approaches at similar scale. Mechanistic analysis through the boundary tokens further reveals three findings: (i) <swi> is a sharply localised, learned switching policy rather than a stylistic artefact; (ii) the latent step it opens performs problem-specific, causally important computation rather than acting as an inert placeholder; and (iii) that computation is concentrated at a single hidden-state transition on entry. Together, these results show that hidden-state-recurrence latent reasoning is both RL-trainable and open to direct mechanistic analysis, including of how on-policy RL itself improves the model from the inside.