강화 학습을 통한 하이브리드 잠재 추론

초록

대규모 언어 모델(LLMs)의 최근 발전은 자기회귀적 추론의 유망한 대안으로 잠재적 추론을 도입했습니다. 이전 단계의 숨겨진 상태를 활용하여 내부 계산을 수행함으로써, 잠재적 추론은 이산적인 사고의 연쇄(CoT) 경로를 샘플링하는 대신 더 많은 정보를 제공하는 특징을 활용할 수 있습니다. 그러나 잠재적 추론 접근법은 종종 LLMs와 호환되지 않는데, 이는 연속적인 패러다임이 자기회귀적 생성의 이산적 특성과 충돌하기 때문입니다. 더욱이, 이러한 방법들은 학습을 위해 CoT 흔적에 의존하므로 LLMs의 내재적 추론 패턴을 충분히 활용하지 못합니다. 본 연구에서는 강화 학습(RL)을 통해 LLMs의 내재적 능력을 활용하여 잠재적 추론을 탐구합니다. 이를 위해, 우리는 하이브리드 추론 정책 최적화(HRPO)를 소개합니다. HRPO는 (1) 학습 가능한 게이트 메커니즘을 통해 이전의 숨겨진 상태를 샘플링된 토큰에 통합하고, (2) 주로 토큰 임베딩으로 학습을 초기화하면서 점점 더 많은 숨겨진 특징을 통합하는 RL 기반의 하이브리드 잠재적 추론 접근법입니다. 이 설계는 LLMs의 생성 능력을 유지하면서 이산적 및 연속적 표현을 모두 사용하는 하이브리드 추론을 장려합니다. 또한, 하이브리드 HRPO는 토큰 샘플링을 통해 잠재적 추론에 확률적 요소를 도입함으로써 CoT 궤적 없이도 RL 기반 최적화를 가능하게 합니다. 다양한 벤치마크에 대한 광범위한 평가 결과, HRPO는 지식 집약적 및 추론 집약적 작업 모두에서 기존 방법들을 능가하는 것으로 나타났습니다. 더불어, HRPO로 학습된 LLMs는 해석 가능성을 유지하며 교차 언어 패턴 및 더 짧은 완성 길이와 같은 흥미로운 행동을 보여주어, 우리의 RL 기반 접근법의 잠재력을 강조하고 잠재적 추론에 대한 향후 연구를 위한 통찰을 제공합니다.

English

Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as their continuous paradigm conflicts with the discrete nature of autoregressive generation. Moreover, these methods rely on CoT traces for training and thus fail to exploit the inherent reasoning patterns of LLMs. In this work, we explore latent reasoning by leveraging the intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that (1) integrates prior hidden states into sampled tokens with a learnable gating mechanism, and (2) initializes training with predominantly token embeddings while progressively incorporating more hidden features. This design maintains LLMs' generative capabilities and incentivizes hybrid reasoning using both discrete and continuous representations. In addition, the hybrid HRPO introduces stochasticity into latent reasoning via token sampling, thereby enabling RL-based optimization without requiring CoT trajectories. Extensive evaluations across diverse benchmarks show that HRPO outperforms prior methods in both knowledge- and reasoning-intensive tasks. Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths, highlighting the potential of our RL-based approach and offer insights for future work in latent reasoning.

강화 학습을 통한 하이브리드 잠재 추론

Hybrid Latent Reasoning via Reinforcement Learning

초록

Support