言語モデルのための変分推論

要旨

我々は、思考の軌跡を潜在変数として扱い、変分推論を通じて最適化する言語モデルのための変分推論フレームワークを提案する。エビデンス下界（ELBO）を出発点として、より厳密な境界を得るために多軌跡目的関数に拡張し、変分事後分布の学習を安定化する前方KL定式化を提案する。さらに、棄却サンプリングによるファインチューニングやGRPOを含む二値報酬強化学習が、局所的な前方KL目的関数として解釈可能であることを示す。この導出から、モデルの精度に基づく暗黙の重み付けが自然に生じ、これまで気づかれていなかった簡単な問題へのバイアスが明らかになる。我々は、Qwen 2.5およびQwen 3モデルファミリーを用いて、幅広い推論タスクにおいて本手法を実証的に検証する。全体として、本研究は変分推論と強化学習スタイルの手法を統合し、言語モデルの推論能力を向上させるための安定した目的関数を提供する、確率的視点に基づく原理的なアプローチを提供する。コードはhttps://github.com/sail-sg/variational-reasoningで公開されている。

English

We introduce a variational reasoning framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where an implicit weighting by model accuracy naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models. Our code is available at https://github.com/sail-sg/variational-reasoning.

言語モデルのための変分推論

Variational Reasoning for Language Models

要旨

Support