ChatPaper.aiChatPaper

混合潛在推理的強化學習方法

Hybrid Latent Reasoning via Reinforcement Learning

May 24, 2025
作者: Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu Shang, Jiawei Han, Dong Wang
cs.AI

摘要

近年來,大型語言模型(LLMs)的進展引入了潛在推理作為自迴歸推理的一種有前景的替代方案。通過利用先前步驟的隱藏狀態進行內部計算,潛在推理能夠從更具信息量的特徵中獲益,而非依賴於採樣離散的思維鏈(CoT)路徑。然而,潛在推理方法往往與LLMs不相容,因為其連續性範式與自迴歸生成的離散性質相衝突。此外,這些方法依賴於CoT軌跡進行訓練,因此未能充分利用LLMs固有的推理模式。在本研究中,我們通過強化學習(RL)利用LLMs的內在能力來探索潛在推理。為此,我們引入了混合推理策略優化(HRPO),這是一種基於RL的混合潛在推理方法,它(1)通過可學習的門控機制將先前的隱藏狀態整合到採樣的令牌中,(2)在訓練初期主要使用令牌嵌入,並逐步引入更多的隱藏特徵。這種設計保持了LLMs的生成能力,並激勵使用離散和連續表示的混合推理。此外,混合HRPO通過令牌採樣將隨機性引入潛在推理,從而實現了基於RL的優化,而無需CoT軌跡。在各種基準測試中的廣泛評估表明,HRPO在知識密集型和推理密集型任務中均優於先前的方法。此外,經過HRPO訓練的LLMs仍保持可解釋性,並展現出跨語言模式和更短完成長度等有趣行為,凸顯了我們基於RL方法的潛力,並為未來潛在推理的研究提供了洞見。
English
Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as their continuous paradigm conflicts with the discrete nature of autoregressive generation. Moreover, these methods rely on CoT traces for training and thus fail to exploit the inherent reasoning patterns of LLMs. In this work, we explore latent reasoning by leveraging the intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that (1) integrates prior hidden states into sampled tokens with a learnable gating mechanism, and (2) initializes training with predominantly token embeddings while progressively incorporating more hidden features. This design maintains LLMs' generative capabilities and incentivizes hybrid reasoning using both discrete and continuous representations. In addition, the hybrid HRPO introduces stochasticity into latent reasoning via token sampling, thereby enabling RL-based optimization without requiring CoT trajectories. Extensive evaluations across diverse benchmarks show that HRPO outperforms prior methods in both knowledge- and reasoning-intensive tasks. Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths, highlighting the potential of our RL-based approach and offer insights for future work in latent reasoning.

Summary

AI-Generated Summary

PDF52May 27, 2025