ハイブリッド潜在推論による強化学習

要旨

大規模言語モデル（LLM）の最近の進展により、潜在推論が自己回帰的推論の有望な代替手段として導入された。潜在推論は、前段階の隠れ状態を用いた内部計算を行うことで、離散的な連鎖思考（CoT）経路をサンプリングするよりも情報量の多い特徴を活用する。しかし、潜在推論アプローチはしばしばLLMと互換性がない。なぜなら、その連続的なパラダイムは自己回帰的生成の離散的な性質と衝突するためである。さらに、これらの手法は訓練にCoTトレースを依存するため、LLMの内在的な推論パターンを十分に活用できない。本研究では、強化学習（RL）を介してLLMの内在的機能を活用することで潜在推論を探求する。そのために、ハイブリッド推論ポリシー最適化（HRPO）を導入する。これはRLベースのハイブリッド潜在推論アプローチであり、（1）学習可能なゲーティング機構を用いて過去の隠れ状態をサンプリングされたトークンに統合し、（2）訓練を主にトークン埋め込みで初期化しながら、徐々により多くの隠れ特徴を取り入れる。この設計はLLMの生成能力を維持しつつ、離散的および連続的表現の両方を用いたハイブリッド推論を促進する。さらに、HRPOはトークンサンプリングを通じて潜在推論に確率性を導入し、CoT軌跡を必要とせずにRLベースの最適化を可能にする。多様なベンチマークでの広範な評価により、HRPOが知識集約型および推論集約型タスクの両方において従来の手法を凌駕することが示された。さらに、HRPOで訓練されたLLMは解釈可能性を保ち、異言語間パターンや短い完了長といった興味深い振る舞いを示し、RLベースのアプローチの可能性を強調し、潜在推論の将来の研究への洞察を提供する。

English

Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as their continuous paradigm conflicts with the discrete nature of autoregressive generation. Moreover, these methods rely on CoT traces for training and thus fail to exploit the inherent reasoning patterns of LLMs. In this work, we explore latent reasoning by leveraging the intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that (1) integrates prior hidden states into sampled tokens with a learnable gating mechanism, and (2) initializes training with predominantly token embeddings while progressively incorporating more hidden features. This design maintains LLMs' generative capabilities and incentivizes hybrid reasoning using both discrete and continuous representations. In addition, the hybrid HRPO introduces stochasticity into latent reasoning via token sampling, thereby enabling RL-based optimization without requiring CoT trajectories. Extensive evaluations across diverse benchmarks show that HRPO outperforms prior methods in both knowledge- and reasoning-intensive tasks. Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths, highlighting the potential of our RL-based approach and offer insights for future work in latent reasoning.

ハイブリッド潜在推論による強化学習

Hybrid Latent Reasoning via Reinforcement Learning

要旨

Support