LLaDA 1.5:面向大规模语言扩散模型的方差缩减偏好优化
LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models
May 25, 2025
作者: Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, Chongxuan Li
cs.AI
摘要
尽管掩码扩散模型(MDMs),如LLaDA,为语言建模展示了一个颇具前景的范式,但在通过强化学习将这些模型与人类偏好对齐方面,相关努力相对较少。这一挑战主要源于基于证据下界(ELBO)的似然估计在偏好优化过程中存在的高方差问题。为解决此问题,我们提出了方差缩减偏好优化(VRPO)框架,该框架正式分析了ELBO估计器的方差,并推导出偏好优化梯度的偏差和方差界限。基于这一理论基础,我们引入了无偏方差缩减策略,包括最优蒙特卡洛预算分配和对立采样,这些策略显著提升了MDM对齐的性能。我们通过将VRPO应用于LLaDA,展示了其有效性,所得模型LLaDA 1.5在数学(GSM8K +4.7)、代码(HumanEval +3.0, MBPP +1.8)及对齐基准(IFEval +4.0, Arena-Hard +4.3)上均一致且显著超越了仅使用监督微调的前代模型。此外,与强大的语言MDMs和ARMs相比,LLaDA 1.5在数学表现上展现出高度竞争力。项目页面:https://ml-gsai.github.io/LLaDA-1.5-Demo/。
English
While Masked Diffusion Models (MDMs), such as LLaDA, present a promising
paradigm for language modeling, there has been relatively little effort in
aligning these models with human preferences via reinforcement learning. The
challenge primarily arises from the high variance in Evidence Lower Bound
(ELBO)-based likelihood estimates required for preference optimization. To
address this issue, we propose Variance-Reduced Preference Optimization (VRPO),
a framework that formally analyzes the variance of ELBO estimators and derives
bounds on both the bias and variance of preference optimization gradients.
Building on this theoretical foundation, we introduce unbiased variance
reduction strategies, including optimal Monte Carlo budget allocation and
antithetic sampling, that significantly improve the performance of MDM
alignment. We demonstrate the effectiveness of VRPO by applying it to LLaDA,
and the resulting model, LLaDA 1.5, outperforms its SFT-only predecessor
consistently and significantly across mathematical (GSM8K +4.7), code
(HumanEval +3.0, MBPP +1.8), and alignment benchmarks (IFEval +4.0, Arena-Hard
+4.3). Furthermore, LLaDA 1.5 demonstrates a highly competitive mathematical
performance compared to strong language MDMs and ARMs. Project page:
https://ml-gsai.github.io/LLaDA-1.5-Demo/.Summary
AI-Generated Summary