ChatPaper.aiChatPaper

语言模型的变分推理

Variational Reasoning for Language Models

September 26, 2025
作者: Xiangxin Zhou, Zichen Liu, Haonan Wang, Chao Du, Min Lin, Chongxuan Li, Liang Wang, Tianyu Pang
cs.AI

摘要

我们提出了一种面向语言模型的变分推理框架,该框架将思维轨迹视为潜在变量,并通过变分推理对其进行优化。从证据下界(ELBO)出发,我们将其扩展为多轨迹目标以获得更紧的界限,并提出了一种前向KL公式,以稳定变分后验的训练。我们进一步表明,拒绝采样微调和二元奖励强化学习(包括GRPO)可被解释为局部前向KL目标,其中模型准确性的隐式加权自然地从推导中产生,并揭示了对较简单问题的一种先前未被注意到的偏好。我们在Qwen 2.5和Qwen 3模型家族上,针对广泛的推理任务,实证验证了我们的方法。总体而言,我们的工作提供了一个原则性的概率视角,将变分推理与强化学习风格的方法统一起来,并产生了稳定的目标,以提升语言模型的推理能力。我们的代码可在https://github.com/sail-sg/variational-reasoning获取。
English
We introduce a variational reasoning framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where an implicit weighting by model accuracy naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models. Our code is available at https://github.com/sail-sg/variational-reasoning.
PDF582September 29, 2025