ChatPaper.aiChatPaper

耦合变分强化学习在语言模型通用推理中的应用

Coupled Variational Reinforcement Learning for Language Model General Reasoning

December 14, 2025
作者: Xueru Wen, Jie Lou, Yanjiang Liu, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Yaojie Lu, Debing Zhang
cs.AI

摘要

尽管强化学习在语言模型推理领域取得了显著进展,但其发展仍受限于可验证奖励的要求。近期出现的无验证器强化学习方法通过利用大语言模型生成参考答案的内在概率作为奖励信号,突破了这一限制。然而,这些方法通常仅基于问题对推理轨迹进行采样,这种设计使推理轨迹采样与答案信息相分离,导致探索效率低下以及轨迹与最终答案间缺乏连贯性。本文提出耦合变分强化学习(CoVRL),通过混合采样策略耦合先验分布与后验分布,搭建起变分推断与强化学习之间的桥梁。通过构建并优化融合这两种分布的复合分布,CoVRL在保持强思维-答案连贯性的同时实现了高效探索。在数学推理和通用推理基准上的大量实验表明,CoVRL相较基线模型性能提升12.4%,较当前先进的无验证器强化学习基线方法额外提升2.3%,为增强语言模型的通用推理能力提供了理论框架。
English
While reinforcement learning have achieved impressive progress in language model reasoning, they are constrained by the requirement for verifiable rewards. Recent verifier-free RL methods address this limitation by utilizing the intrinsic probabilities of LLMs generating reference answers as reward signals. However, these approaches typically sample reasoning traces conditioned only on the question. This design decouples reasoning-trace sampling from answer information, leading to inefficient exploration and incoherence between traces and final answers. In this paper, we propose \b{Coupled Variational Reinforcement Learning} (CoVRL), which bridges variational inference and reinforcement learning by coupling prior and posterior distributions through a hybrid sampling strategy. By constructing and optimizing a composite distribution that integrates these two distributions, CoVRL enables efficient exploration while preserving strong thought-answer coherence. Extensive experiments on mathematical and general reasoning benchmarks show that CoVRL improves performance by 12.4\% over the base model and achieves an additional 2.3\% improvement over strong state-of-the-art verifier-free RL baselines, providing a principled framework for enhancing the general reasoning capabilities of language models.
PDF22December 20, 2025