混合潛在推理的強化學習方法
Hybrid Latent Reasoning via Reinforcement Learning
May 24, 2025
作者: Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu Shang, Jiawei Han, Dong Wang
cs.AI
摘要
近年來,大型語言模型(LLMs)的進展引入了潛在推理作為自迴歸推理的一種有前景的替代方案。通過利用先前步驟的隱藏狀態進行內部計算,潛在推理能夠從更具信息量的特徵中獲益,而非依賴於採樣離散的思維鏈(CoT)路徑。然而,潛在推理方法往往與LLMs不相容,因為其連續性範式與自迴歸生成的離散性質相衝突。此外,這些方法依賴於CoT軌跡進行訓練,因此未能充分利用LLMs固有的推理模式。在本研究中,我們通過強化學習(RL)利用LLMs的內在能力來探索潛在推理。為此,我們引入了混合推理策略優化(HRPO),這是一種基於RL的混合潛在推理方法,它(1)通過可學習的門控機制將先前的隱藏狀態整合到採樣的令牌中,(2)在訓練初期主要使用令牌嵌入,並逐步引入更多的隱藏特徵。這種設計保持了LLMs的生成能力,並激勵使用離散和連續表示的混合推理。此外,混合HRPO通過令牌採樣將隨機性引入潛在推理,從而實現了基於RL的優化,而無需CoT軌跡。在各種基準測試中的廣泛評估表明,HRPO在知識密集型和推理密集型任務中均優於先前的方法。此外,經過HRPO訓練的LLMs仍保持可解釋性,並展現出跨語言模式和更短完成長度等有趣行為,凸顯了我們基於RL方法的潛力,並為未來潛在推理的研究提供了洞見。
English
Recent advances in large language models (LLMs) have introduced latent
reasoning as a promising alternative to autoregressive reasoning. By performing
internal computation with hidden states from previous steps, latent reasoning
benefit from more informative features rather than sampling a discrete
chain-of-thought (CoT) path. Yet latent reasoning approaches are often
incompatible with LLMs, as their continuous paradigm conflicts with the
discrete nature of autoregressive generation. Moreover, these methods rely on
CoT traces for training and thus fail to exploit the inherent reasoning
patterns of LLMs. In this work, we explore latent reasoning by leveraging the
intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we
introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid
latent reasoning approach that (1) integrates prior hidden states into sampled
tokens with a learnable gating mechanism, and (2) initializes training with
predominantly token embeddings while progressively incorporating more hidden
features. This design maintains LLMs' generative capabilities and incentivizes
hybrid reasoning using both discrete and continuous representations. In
addition, the hybrid HRPO introduces stochasticity into latent reasoning via
token sampling, thereby enabling RL-based optimization without requiring CoT
trajectories. Extensive evaluations across diverse benchmarks show that HRPO
outperforms prior methods in both knowledge- and reasoning-intensive tasks.
Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing
behaviors like cross-lingual patterns and shorter completion lengths,
highlighting the potential of our RL-based approach and offer insights for
future work in latent reasoning.Summary
AI-Generated Summary