语言模型的变分推理

摘要

我们提出了一种面向语言模型的变分推理框架，该框架将思维轨迹视为潜在变量，并通过变分推理对其进行优化。从证据下界（ELBO）出发，我们将其扩展为多轨迹目标以获得更紧的界限，并提出了一种前向KL公式，以稳定变分后验的训练。我们进一步表明，拒绝采样微调和二元奖励强化学习（包括GRPO）可被解释为局部前向KL目标，其中模型准确性的隐式加权自然地从推导中产生，并揭示了对较简单问题的一种先前未被注意到的偏好。我们在Qwen 2.5和Qwen 3模型家族上，针对广泛的推理任务，实证验证了我们的方法。总体而言，我们的工作提供了一个原则性的概率视角，将变分推理与强化学习风格的方法统一起来，并产生了稳定的目标，以提升语言模型的推理能力。我们的代码可在https://github.com/sail-sg/variational-reasoning获取。

English

We introduce a variational reasoning framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where an implicit weighting by model accuracy naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models. Our code is available at https://github.com/sail-sg/variational-reasoning.

语言模型的变分推理

Variational Reasoning for Language Models

摘要

Support