語言模型的變分推理

摘要

我們提出了一種針對語言模型的變分推理框架，該框架將思維軌跡視為潛在變量，並通過變分推理對其進行優化。從證據下界（ELBO）出發，我們將其擴展為多軌跡目標以獲得更緊密的邊界，並提出了一種前向KL公式，以穩定變分後驗的訓練。我們進一步表明，拒絕採樣微調和二元獎勵強化學習（包括GRPO）可以被解釋為局部前向KL目標，其中模型準確性的隱式加權自然從推導中產生，並揭示了一種先前未被注意到的偏向於簡單問題的偏見。我們在Qwen 2.5和Qwen 3模型家族上對多種推理任務進行了實證驗證。總體而言，我們的工作提供了一種原則性的概率視角，將變分推理與強化學習風格的方法統一起來，並為提升語言模型的推理能力提供了穩定的目標。我們的代碼可在https://github.com/sail-sg/variational-reasoning獲取。

English

We introduce a variational reasoning framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where an implicit weighting by model accuracy naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models. Our code is available at https://github.com/sail-sg/variational-reasoning.

語言模型的變分推理

Variational Reasoning for Language Models

摘要

Support