ChatPaper.aiChatPaper

超越模仿:强化学习驱动的主动潜在规划

Beyond Imitation: Reinforcement Learning for Active Latent Planning

January 29, 2026
作者: Zhi Zheng, Wee Sun Lee
cs.AI

摘要

針對高效密集的思維鏈推理,潛在推理方法通過微調大型語言模型,用連續潛在標記替代離散語言標記。相較傳統語言思維鏈推理,這類方法消耗更少標記量,並具備在稠密潛在空間中進行規劃的潛力。然而現有潛在標記通常基於模仿語言標籤進行監督學習。考慮到單個問題可能存在多個等效但多樣化的思維鏈標籤,被動模仿任意標籤可能導致次優的潛在標記表徵和推理策略,削弱潛在規劃能力並造成訓練與測試間的明顯差距。本研究強調在潛在標記表徵空間中進行主動規劃對實現最優推理策略的重要性,據此提出主動潛在規劃方法(ATP-Latent)。該方法將潛在標記的監督過程建模為條件變分自編碼器,以獲得更平滑的潛在空間;同時引入基於潛在標記VAE解碼內容一致性的輔助連貫性獎勵,開展強化學習來引導最合理的潛在推理策略。在LLaMA-1B上的實驗表明,ATP-Latent在四個基準測試中相比先進基線模型實現準確率提升4.1%、標記消耗降低3.3%。代碼已開源於https://github.com/zz1358m/ATP-Latent-master。
English
Aiming at efficient and dense chain-of-thought (CoT) reasoning, latent reasoning methods fine-tune Large Language Models (LLMs) to substitute discrete language tokens with continuous latent tokens. These methods consume fewer tokens compared to the conventional language CoT reasoning and have the potential to plan in a dense latent space. However, current latent tokens are generally supervised based on imitating language labels. Considering that there can be multiple equivalent but diverse CoT labels for a question, passively imitating an arbitrary one may lead to inferior latent token representations and latent reasoning policies, undermining the potential planning ability and resulting in clear gaps between training and testing. In this work, we emphasize the importance of active planning over the representation space of latent tokens in achieving the optimal latent reasoning policy. So, we propose the Active Latent Planning method (ATP-Latent), which models the supervision process of latent tokens as a conditional variational auto-encoder (VAE) to obtain a smoother latent space. Moreover, to facilitate the most reasonable latent reasoning policy, ATP-Latent conducts reinforcement learning (RL) with an auxiliary coherence reward, which is calculated based on the consistency between VAE-decoded contents of latent tokens, enabling a guided RL process. In experiments on LLaMA-1B, ATP-Latent demonstrates +4.1\% accuracy and -3.3\% tokens on four benchmarks compared to advanced baselines. Codes are available on https://github.com/zz1358m/ATP-Latent-master.
PDF53January 31, 2026