Het ontrafelen van verborgen toestandsrecurrentie: schakelbaar latent redeneren met on-policy bekrachtigingsleren

Samenvatting

Latente chain-of-thought comprimeert redeneren door zichtbare redeneersporen te vervangen door continue verborgen-toestandsrecurrentie, maar bestaande formuleringen zijn moeilijk te optimaliseren met standaard on-policy reinforcement learning (RL) en moeilijk causaal te interpreteren. Ons belangrijkste inzicht is dat een enkel paar expliciete grenstokens beide problemen tegelijk kan aanpakken: discrete in- en uitgangsankers maken het latente blok compatibel met standaard on-policy RL, en dezelfde ankers bieden een natuurlijk aangrijpingspunt voor mechanistische analyse. Gemotiveerd door dit, stellen we SWITCH voor, een schakelbaar latent redeneerframework. Het model genereert <swi> om de latente modus te betreden en </swi> om deze te verlaten. Omdat de grenzen gewone discrete tokens zijn, is de GRPO-beleidsratio op elk beslissingspunt goed gedefinieerd. Dezelfde ankers stellen de latente stappen ook bloot aan directe probing en causale interventie. We trainen het model met een zichtbaar-naar-latent curriculum en een Switch-GRPO-doelstelling die gradiënten propageert door recurrente latente berekening. SWITCH presteert consequent beter dan eerdere verborgen-toestandsrecurrentie latente redeneeraanpakken op vergelijkbare schaal. Mechanistische analyse via de grenstokens onthult verder drie bevindingen: (i) <swi> is een scherp gelokaliseerd, aangeleerd schakelbeleid in plaats van een stilistisch artefact; (ii) de latente stap die het opent, voert probleemspecifieke, causaal belangrijke berekening uit in plaats van te fungeren als een inert placeholder; en (iii) die berekening is geconcentreerd op een enkele verborgen-toestandsovergang bij binnenkomst. Samen tonen deze resultaten aan dat verborgen-toestandsrecurrentie latent redeneren zowel RL-traineerbaar is als openstaat voor directe mechanistische analyse, inclusief hoe on-policy RL zelf het model van binnenuit verbetert.

English

Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is that a single pair of explicit boundary tokens can address both issues at once: discrete entry and exit anchors make the latent block compatible with standard on-policy RL, and the same anchors offer a natural foothold for mechanistic analysis. Motivated by this, we propose SWITCH, a switchable latent reasoning framework. The model emits <swi> to enter latent mode and </swi> to exit. Because the boundaries are ordinary discrete tokens, the GRPO policy ratio is well-defined at every decision point. The same anchors also expose the latent steps to direct probing and causal intervention. We train the model with a visible-to-latent curriculum and a Switch-GRPO objective that propagates gradients through recurrent latent computation. SWITCH consistently outperforms prior hidden-state-recurrence latent reasoning approaches at similar scale. Mechanistic analysis through the boundary tokens further reveals three findings: (i) <swi> is a sharply localised, learned switching policy rather than a stylistic artefact; (ii) the latent step it opens performs problem-specific, causally important computation rather than acting as an inert placeholder; and (iii) that computation is concentrated at a single hidden-state transition on entry. Together, these results show that hidden-state-recurrence latent reasoning is both RL-trainable and open to direct mechanistic analysis, including of how on-policy RL itself improves the model from the inside.