LLaDA 1.5: 大規模言語拡散モデルのための分散低減型選好最適化

要旨

マスク拡散モデル（MDM）であるLLaDAなどは、言語モデリングにおいて有望なパラダイムを示していますが、強化学習を用いてこれらのモデルを人間の好みに合わせる取り組みは比較的少ない状況です。この課題は主に、選好最適化に必要なEvidence Lower Bound（ELBO）に基づく尤度推定の高分散に起因しています。この問題に対処するため、我々は分散低減選好最適化（Variance-Reduced Preference Optimization, VRPO）を提案します。このフレームワークは、ELBO推定量の分散を形式的に分析し、選好最適化勾配のバイアスと分散の両方に対する境界を導出します。この理論的基盤に基づいて、最適なモンテカルロ予算配分や対称サンプリングなどの不偏分散低減戦略を導入し、MDMのアライメント性能を大幅に向上させます。我々はVRPOをLLaDAに適用し、その結果得られたモデルであるLLaDA 1.5が、数学（GSM8K +4.7）、コード（HumanEval +3.0、MBPP +1.8）、およびアライメントベンチマーク（IFEval +4.0、Arena-Hard +4.3）において、SFTのみの前身モデルを一貫して大幅に上回ることを実証しました。さらに、LLaDA 1.5は、強力な言語MDMやARMと比較しても非常に競争力のある数学的性能を示しています。プロジェクトページ：https://ml-gsai.github.io/LLaDA-1.5-Demo/。

English

While Masked Diffusion Models (MDMs), such as LLaDA, present a promising paradigm for language modeling, there has been relatively little effort in aligning these models with human preferences via reinforcement learning. The challenge primarily arises from the high variance in Evidence Lower Bound (ELBO)-based likelihood estimates required for preference optimization. To address this issue, we propose Variance-Reduced Preference Optimization (VRPO), a framework that formally analyzes the variance of ELBO estimators and derives bounds on both the bias and variance of preference optimization gradients. Building on this theoretical foundation, we introduce unbiased variance reduction strategies, including optimal Monte Carlo budget allocation and antithetic sampling, that significantly improve the performance of MDM alignment. We demonstrate the effectiveness of VRPO by applying it to LLaDA, and the resulting model, LLaDA 1.5, outperforms its SFT-only predecessor consistently and significantly across mathematical (GSM8K +4.7), code (HumanEval +3.0, MBPP +1.8), and alignment benchmarks (IFEval +4.0, Arena-Hard +4.3). Furthermore, LLaDA 1.5 demonstrates a highly competitive mathematical performance compared to strong language MDMs and ARMs. Project page: https://ml-gsai.github.io/LLaDA-1.5-Demo/.

LLaDA 1.5: 大規模言語拡散モデルのための分散低減型選好最適化

LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

要旨

Support