ChatPaper.aiChatPaper

耦合变分强化学习在语言模型泛化推理中的应用

Coupled Variational Reinforcement Learning for Language Model General Reasoning

December 14, 2025
作者: Xueru Wen, Jie Lou, Yanjiang Liu, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Yaojie Lu, Debing Zhang
cs.AI

摘要

尽管强化学习在语言模型推理领域已取得显著进展,但其发展始终受限于可验证奖励的需求。近期提出的免验证器强化学习方法通过利用语言模型生成参考答案的内在概率作为奖励信号,突破了这一限制。然而,这些方法通常仅基于问题本身对推理轨迹进行采样,该设计使得推理轨迹采样与答案信息相分离,导致探索效率低下以及轨迹与最终答案之间的不连贯性。本文提出耦合变分强化学习方法,通过混合采样策略将先验分布与后验分布相耦合,从而搭建起变分推断与强化学习之间的桥梁。通过构建并优化融合这两种分布的复合分布,该方法在保持强思维-答案连贯性的同时实现了高效探索。在数学推理与通用推理基准上的大量实验表明,该方法相比基线模型性能提升12.4%,较当前最强的免验证器强化学习基线模型进一步获得2.3%的性能提升,为增强语言模型的通用推理能力提供了理论框架。
English
While reinforcement learning have achieved impressive progress in language model reasoning, they are constrained by the requirement for verifiable rewards. Recent verifier-free RL methods address this limitation by utilizing the intrinsic probabilities of LLMs generating reference answers as reward signals. However, these approaches typically sample reasoning traces conditioned only on the question. This design decouples reasoning-trace sampling from answer information, leading to inefficient exploration and incoherence between traces and final answers. In this paper, we propose \b{Coupled Variational Reinforcement Learning} (CoVRL), which bridges variational inference and reinforcement learning by coupling prior and posterior distributions through a hybrid sampling strategy. By constructing and optimizing a composite distribution that integrates these two distributions, CoVRL enables efficient exploration while preserving strong thought-answer coherence. Extensive experiments on mathematical and general reasoning benchmarks show that CoVRL improves performance by 12.4\% over the base model and achieves an additional 2.3\% improvement over strong state-of-the-art verifier-free RL baselines, providing a principled framework for enhancing the general reasoning capabilities of language models.
PDF22December 20, 2025