ChatPaper.aiChatPaper

語言模型的變分推理

Variational Reasoning for Language Models

September 26, 2025
作者: Xiangxin Zhou, Zichen Liu, Haonan Wang, Chao Du, Min Lin, Chongxuan Li, Liang Wang, Tianyu Pang
cs.AI

摘要

我們提出了一種針對語言模型的變分推理框架,該框架將思維軌跡視為潛在變量,並通過變分推理對其進行優化。從證據下界(ELBO)出發,我們將其擴展為多軌跡目標以獲得更緊密的邊界,並提出了一種前向KL公式,以穩定變分後驗的訓練。我們進一步表明,拒絕採樣微調和二元獎勵強化學習(包括GRPO)可以被解釋為局部前向KL目標,其中模型準確性的隱式加權自然從推導中產生,並揭示了一種先前未被注意到的偏向於簡單問題的偏見。我們在Qwen 2.5和Qwen 3模型家族上對多種推理任務進行了實證驗證。總體而言,我們的工作提供了一種原則性的概率視角,將變分推理與強化學習風格的方法統一起來,並為提升語言模型的推理能力提供了穩定的目標。我們的代碼可在https://github.com/sail-sg/variational-reasoning獲取。
English
We introduce a variational reasoning framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where an implicit weighting by model accuracy naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models. Our code is available at https://github.com/sail-sg/variational-reasoning.
PDF582September 29, 2025