揭秘隐状态循环：基于同策略强化学习的可切换潜在推理

摘要

潜在思维链通过将可见的推理轨迹替换为连续的隐藏状态递归来压缩推理，但现有的形式化方法难以使用标准的在策略强化学习（RL）进行优化，且难以从因果角度进行解释。我们的关键洞察在于，一对显式的边界标记可以同时解决这两个问题：离散的进入和退出锚点使潜在块兼容标准的在策略RL，同时这些相同的锚点为机械分析提供了自然的立足点。基于此，我们提出SWITCH，一个可切换的潜在推理框架。模型生成<swi>进入潜在模式，生成</swi>退出。由于这些边界是普通的离散标记，GRPO策略比率在每个决策点都有明确的定义。相同的锚点还使潜在步骤暴露于直接探测和因果干预。我们通过可见到潜在的课程以及Switch-GRPO目标来训练模型，该目标通过递归潜在计算传播梯度。SWITCH在相似规模下始终优于先前的隐藏状态递归潜在推理方法。通过边界标记进行的机械分析进一步揭示了三个发现：（i）<swi>是一个高度局部化的、习得的切换策略，而非风格化的伪影；（ii）它开启的潜在步骤执行特定于问题的、因果重要的计算，而非充当惰性占位符；（iii）该计算集中在进入时的单个隐藏状态转换上。这些结果共同表明，隐藏状态递归潜在推理既可通过RL训练，也可进行直接的机械分析，包括在策略RL本身如何从内部改进模型。

English

Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is that a single pair of explicit boundary tokens can address both issues at once: discrete entry and exit anchors make the latent block compatible with standard on-policy RL, and the same anchors offer a natural foothold for mechanistic analysis. Motivated by this, we propose SWITCH, a switchable latent reasoning framework. The model emits <swi> to enter latent mode and </swi> to exit. Because the boundaries are ordinary discrete tokens, the GRPO policy ratio is well-defined at every decision point. The same anchors also expose the latent steps to direct probing and causal intervention. We train the model with a visible-to-latent curriculum and a Switch-GRPO objective that propagates gradients through recurrent latent computation. SWITCH consistently outperforms prior hidden-state-recurrence latent reasoning approaches at similar scale. Mechanistic analysis through the boundary tokens further reveals three findings: (i) <swi> is a sharply localised, learned switching policy rather than a stylistic artefact; (ii) the latent step it opens performs problem-specific, causally important computation rather than acting as an inert placeholder; and (iii) that computation is concentrated at a single hidden-state transition on entry. Together, these results show that hidden-state-recurrence latent reasoning is both RL-trainable and open to direct mechanistic analysis, including of how on-policy RL itself improves the model from the inside.