은닉 상태 순환의 탈신비화: 온-정책 강화 학습을 통한 전환 가능한 잠재 추론

초록

잠재 사고사슬(latent chain-of-thought)은 가시적 추론 과정을 연속적인 은닉 상태 순환으로 대체하여 추론을 압축하지만, 기존 방식은 표준 온-정책 강화학습(RL)으로 최적화하기 어렵고 인과적으로 해석하기도 까다롭습니다. 본 연구의 핵심 통찰은 단 한 쌍의 명시적 경계 토큰이 이 두 문제를 동시에 해결할 수 있다는 점입니다: 이산적인 진입 및 이탈 앵커는 잠재 블록을 표준 온-정책 RL과 호환되게 하며, 동일한 앵커는 메커니즘 분석을 위한 자연스러운 발판을 제공합니다. 이에 착안하여 본 연구는 전환 가능한 잠재 추론 프레임워크인 SWITCH를 제안합니다. 모델은 <swi>를 출력하여 잠재 모드로 진입하고 </swi>를 출력하여 이탈합니다. 경계가 일반적인 이산 토큰이므로 모든 의사 결정 지점에서 GRPO 정책 비율이 명확히 정의됩니다. 또한 동일한 앙커는 잠재 단계를 직접적인 탐사 및 인과적 개입에 노출시킵니다. 본 연구는 가시적-잠재적 커리큘럼과 순환적 잠재 연산을 통해 그래디언트를 전파하는 Switch-GRPO 목적 함수로 모델을 학습합니다. SWITCH는 유사한 규모의 기존 은닉 상태 순환 잠재 추론 방식보다 일관되게 우수한 성능을 보입니다. 경계 토큰을 통한 메커니즘 분석은 다음 세 가지 발견점을 추가로 제시합니다: (i) <swi>는 문체적 인공물이 아닌 급격히 지역화된 학습된 전환 정책이며, (ii) 이를 통해 열리는 잠재 단계는 비활성 자리채움 역할이 아닌, 문제 특화적이고 인과적으로 중요한 연산을 수행하고, (iii) 해당 연산은 진입 시 단일 은닉 상태 전이에 집중됩니다. 이러한 결과는 은닉 상태 순환 잠재 추론이 RL 학습이 가능할 뿐만 아니라, 온-정책 RL 자체가 모델을 내부적으로 어떻게 개선하는지에 대한 직접적인 메커니즘 분석에도 개방적임을 종합적으로 보여줍니다.

English

Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is that a single pair of explicit boundary tokens can address both issues at once: discrete entry and exit anchors make the latent block compatible with standard on-policy RL, and the same anchors offer a natural foothold for mechanistic analysis. Motivated by this, we propose SWITCH, a switchable latent reasoning framework. The model emits <swi> to enter latent mode and </swi> to exit. Because the boundaries are ordinary discrete tokens, the GRPO policy ratio is well-defined at every decision point. The same anchors also expose the latent steps to direct probing and causal intervention. We train the model with a visible-to-latent curriculum and a Switch-GRPO objective that propagates gradients through recurrent latent computation. SWITCH consistently outperforms prior hidden-state-recurrence latent reasoning approaches at similar scale. Mechanistic analysis through the boundary tokens further reveals three findings: (i) <swi> is a sharply localised, learned switching policy rather than a stylistic artefact; (ii) the latent step it opens performs problem-specific, causally important computation rather than acting as an inert placeholder; and (iii) that computation is concentrated at a single hidden-state transition on entry. Together, these results show that hidden-state-recurrence latent reasoning is both RL-trainable and open to direct mechanistic analysis, including of how on-policy RL itself improves the model from the inside.