超越模仿:基于强化学习的主动潜在规划
Beyond Imitation: Reinforcement Learning for Active Latent Planning
January 29, 2026
作者: Zhi Zheng, Wee Sun Lee
cs.AI
摘要
针对高效密集的思维链推理,潜在推理方法通过微调大语言模型,用连续潜在标记替代离散语言标记。与传统语言思维链推理相比,这些方法消耗更少的标记量,并具备在稠密潜在空间进行规划的潜力。然而现有潜在标记的监督通常基于对语言标签的模仿。考虑到同一问题可能存在多种等价但不同的思维链标签,被动模仿任意标签可能导致次优的潜在标记表示和推理策略,削弱潜在规划能力并造成训练与测试间的明显差距。本研究强调在潜在标记表示空间中进行主动规划对实现最优推理策略的重要性,由此提出主动潜在规划方法(ATP-Latent)。该方法将潜在标记的监督过程建模为条件变分自编码器,以获得更平滑的潜在空间;同时引入基于潜在标记VAE解码内容一致性的辅助连贯性奖励,通过强化学习引导最优推理策略的形成。在LLaMA-1B上的实验表明,ATP-Latent在四个基准测试中相较先进基线实现了+4.1%的准确率提升和-3.3%的标记消耗降低。代码已开源于https://github.com/zz1358m/ATP-Latent-master。
English
Aiming at efficient and dense chain-of-thought (CoT) reasoning, latent reasoning methods fine-tune Large Language Models (LLMs) to substitute discrete language tokens with continuous latent tokens. These methods consume fewer tokens compared to the conventional language CoT reasoning and have the potential to plan in a dense latent space. However, current latent tokens are generally supervised based on imitating language labels. Considering that there can be multiple equivalent but diverse CoT labels for a question, passively imitating an arbitrary one may lead to inferior latent token representations and latent reasoning policies, undermining the potential planning ability and resulting in clear gaps between training and testing. In this work, we emphasize the importance of active planning over the representation space of latent tokens in achieving the optimal latent reasoning policy. So, we propose the Active Latent Planning method (ATP-Latent), which models the supervision process of latent tokens as a conditional variational auto-encoder (VAE) to obtain a smoother latent space. Moreover, to facilitate the most reasonable latent reasoning policy, ATP-Latent conducts reinforcement learning (RL) with an auxiliary coherence reward, which is calculated based on the consistency between VAE-decoded contents of latent tokens, enabling a guided RL process. In experiments on LLaMA-1B, ATP-Latent demonstrates +4.1\% accuracy and -3.3\% tokens on four benchmarks compared to advanced baselines. Codes are available on https://github.com/zz1358m/ATP-Latent-master.