정규화 흐름을 활용한 잠재 추론

초록

대규모 언어 모델은 명시적인 연쇄 사고(CoT)를 생성함으로써 추론 능력을 향상시키는 경우가 많으며, 이는 중간 계산의 중요성을 보여준다. 그러나 텍스트 기반 CoT는 이러한 계산을 이산적이고 직렬적이며 통신 지향적인 토큰 스트림에 강제한다: 각 추론 단계는 모델이 진행하기 전에 반드시 언어화되어야 하며, 이는 기본 업데이트가 의미적이거나 불확실하거나 부분적으로만 형성된 경우에도 마찬가지이다. 잠재 추론은 압축된 연속 상태에서 중간 계산을 수행한 후 텍스트로 전환함으로써 더 높은 대역폭의 대안을 제공한다. 그러나 기존의 잠재 추론 방법은 자기회귀 언어 모델에서 CoT를 효과적으로 만드는 핵심 장점들, 즉 본래의 왼쪽에서 오른쪽 생성, 확률적 샘플링, KV-캐시 디코딩과의 호환성, 그리고 다루기 쉬운 가능도 추정을 종종 희생한다. 우리는 정규화 흐름(normalizing flows)으로 연속적인 사고를 모델링하여 이러한 장점들을 유지하는 잠재 추론 프레임워크인 NF-CoT를 제안한다. NF-CoT는 LLM 백본 내부에 TARFlow 스타일의 정규화 흐름을 구현하며, 명시적 CoT로부터 추출된 압축된 연속 사고에 대해 다루기 쉬운 확률 모델을 정의한다. 연속 사고 위치는 NF 헤드에 의해 생성되고, 텍스트 위치는 동일한 인과적 스트림 내에서 표준 LM 헤드에 의해 생성된다. 이 설계는 잠재 사고에 대한 정확한 가능도를 제공하고, 원래 KV 캐시를 사용한 확률적 왼쪽에서 오른쪽 디코딩을 가능하게 하며, 잠재 추론 공간에서 직접 정책 경사 최적화를 지원한다. 코드 생성 벤치마크에서 NF-CoT는 명시적 CoT 및 기존 잠재 추론 기준선보다 통과율을 향상시키면서 중간 추론 비용을 크게 줄인다.

English

Large language models often improve reasoning by generating explicit chain-of-thought (CoT), demonstrating the importance of intermediate computation. However, textual CoT forces this computation through a discrete, serial, and communication-oriented token stream: each reasoning step must be verbalized before the model can proceed, even when the underlying update is semantic, uncertain, or only partially formed. Latent reasoning offers a higher-bandwidth alternative by performing intermediate computation in compact continuous states before committing to text. Yet existing latent-reasoning methods often sacrifice key advantages that make CoT effective in autoregressive language models, including native left-to-right generation, probabilistic sampling, compatibility with KV-cache decoding, and tractable likelihood estimation. We propose NF-CoT, a latent reasoning framework that preserves these advantages by modeling continuous thoughts with normalizing flows. NF-CoT instantiates a TARFlow-style normalizing flow inside the LLM backbone, defining a tractable probability model over compact continuous thoughts distilled from explicit CoT. Continuous-thought positions are generated by an NF head, while text positions are generated by the standard LM head within the same causal stream. This design provides exact likelihoods for latent thoughts, enables probabilistic left-to-right decoding with the original KV cache, and supports direct policy-gradient optimization in the latent reasoning space. On code-generation benchmarks, NF-CoT improves pass rates over explicit-CoT and prior latent-reasoning baselines while substantially reducing intermediate-reasoning cost.