LLaDA 1.5：面向大型语言扩散模型的方差缩减偏好优化

摘要

儘管掩碼擴散模型（Masked Diffusion Models, MDMs），如LLaDA，為語言建模提供了一個前景廣闊的範式，但在通過強化學習將這些模型與人類偏好對齊方面，相關努力相對較少。這一挑戰主要源於基於證據下界（Evidence Lower Bound, ELBO）的似然估計在偏好優化中所需的高方差。為解決這一問題，我們提出了方差減小偏好優化（Variance-Reduced Preference Optimization, VRPO），該框架正式分析了ELBO估計器的方差，並推導了偏好優化梯度的偏差和方差界限。基於這一理論基礎，我們引入了無偏方差減小策略，包括最優蒙特卡羅預算分配和對立採樣，這些策略顯著提升了MDM對齊的性能。我們通過將VRPO應用於LLaDA來展示其有效性，由此產生的模型LLaDA 1.5在數學（GSM8K +4.7）、代碼（HumanEval +3.0, MBPP +1.8）和對齊基準（IFEval +4.0, Arena-Hard +4.3）上均一致且顯著超越了其僅基於SFT的前身。此外，與強大的語言MDMs和ARMs相比，LLaDA 1.5在數學表現上展現出極高的競爭力。項目頁面：https://ml-gsai.github.io/LLaDA-1.5-Demo/。

English

While Masked Diffusion Models (MDMs), such as LLaDA, present a promising paradigm for language modeling, there has been relatively little effort in aligning these models with human preferences via reinforcement learning. The challenge primarily arises from the high variance in Evidence Lower Bound (ELBO)-based likelihood estimates required for preference optimization. To address this issue, we propose Variance-Reduced Preference Optimization (VRPO), a framework that formally analyzes the variance of ELBO estimators and derives bounds on both the bias and variance of preference optimization gradients. Building on this theoretical foundation, we introduce unbiased variance reduction strategies, including optimal Monte Carlo budget allocation and antithetic sampling, that significantly improve the performance of MDM alignment. We demonstrate the effectiveness of VRPO by applying it to LLaDA, and the resulting model, LLaDA 1.5, outperforms its SFT-only predecessor consistently and significantly across mathematical (GSM8K +4.7), code (HumanEval +3.0, MBPP +1.8), and alignment benchmarks (IFEval +4.0, Arena-Hard +4.3). Furthermore, LLaDA 1.5 demonstrates a highly competitive mathematical performance compared to strong language MDMs and ARMs. Project page: https://ml-gsai.github.io/LLaDA-1.5-Demo/.

LLaDA 1.5：面向大型语言扩散模型的方差缩减偏好优化

LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

摘要

Support