언어 모델을 위한 변분 추론

초록

우리는 사고 흔적을 잠재 변수로 취급하고 이를 변분 추론을 통해 최적화하는 언어 모델을 위한 변분 추론 프레임워크를 소개한다. 증거 하한(ELBO)을 출발점으로 삼아, 이를 더 엄밀한 경계를 위한 다중 흔적 목적 함수로 확장하고, 변분 사후 분포의 학습을 안정화하는 순방향 KL(Kullback-Leibler) 공식을 제안한다. 또한, 거부 샘플링 미세 조정과 GRPO를 포함한 이진 보강 강화 학습(RL)이 지역 순방향 KL 목적 함수로 해석될 수 있음을 보이며, 이때 모델 정확도에 따른 암묵적 가중치가 유도 과정에서 자연스럽게 발생하고, 이전에는 주목받지 못했던 쉬운 질문에 대한 편향이 드러남을 밝힌다. 우리는 Qwen 2.5 및 Qwen 3 모델 계열을 대상으로 다양한 추론 과제에서 이 방법을 실증적으로 검증한다. 전반적으로, 본 연구는 변분 추론과 RL 스타일 방법을 통합하고 언어 모델의 추론 능력을 향상시키기 위한 안정적인 목적 함수를 제공하는 원리 기반 확률론적 관점을 제시한다. 코드는 https://github.com/sail-sg/variational-reasoning에서 확인할 수 있다.

English

We introduce a variational reasoning framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where an implicit weighting by model accuracy naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models. Our code is available at https://github.com/sail-sg/variational-reasoning.

언어 모델을 위한 변분 추론

Variational Reasoning for Language Models

초록

Support