ChatPaper.aiChatPaper

混合潜在推理的强化学习方法

Hybrid Latent Reasoning via Reinforcement Learning

May 24, 2025
作者: Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu Shang, Jiawei Han, Dong Wang
cs.AI

摘要

近期,大型语言模型(LLMs)的进展引入了潜在推理作为自回归推理的有力替代方案。通过利用先前步骤的隐藏状态进行内部计算,潜在推理能够从更具信息量的特征中获益,而非依赖于离散的思维链(CoT)路径采样。然而,潜在推理方法常与LLMs不兼容,因其连续范式与自回归生成的离散特性相冲突。此外,这些方法依赖CoT轨迹进行训练,未能充分利用LLMs内在的推理模式。本研究中,我们通过强化学习(RL)挖掘LLMs的固有能力来探索潜在推理。为此,我们提出了混合推理策略优化(HRPO),一种基于RL的混合潜在推理方法,它(1)通过可学习的门控机制将先前的隐藏状态整合到采样的令牌中,(2)在训练初期主要使用令牌嵌入,逐步引入更多隐藏特征。这一设计既保留了LLMs的生成能力,又激励了结合离散与连续表示的混合推理。此外,HRPO通过令牌采样为潜在推理引入随机性,从而无需CoT轨迹即可实现基于RL的优化。在多种基准测试中的广泛评估表明,HRPO在知识和推理密集型任务上均优于现有方法。更重要的是,经过HRPO训练的LLMs保持了可解释性,并展现出跨语言模式和更短完成长度等有趣行为,凸显了我们基于RL方法的潜力,为未来潜在推理研究提供了洞见。
English
Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as their continuous paradigm conflicts with the discrete nature of autoregressive generation. Moreover, these methods rely on CoT traces for training and thus fail to exploit the inherent reasoning patterns of LLMs. In this work, we explore latent reasoning by leveraging the intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that (1) integrates prior hidden states into sampled tokens with a learnable gating mechanism, and (2) initializes training with predominantly token embeddings while progressively incorporating more hidden features. This design maintains LLMs' generative capabilities and incentivizes hybrid reasoning using both discrete and continuous representations. In addition, the hybrid HRPO introduces stochasticity into latent reasoning via token sampling, thereby enabling RL-based optimization without requiring CoT trajectories. Extensive evaluations across diverse benchmarks show that HRPO outperforms prior methods in both knowledge- and reasoning-intensive tasks. Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths, highlighting the potential of our RL-based approach and offer insights for future work in latent reasoning.

Summary

AI-Generated Summary

PDF42May 27, 2025