LLaDA 1.5: 대규모 언어 확산 모델을 위한 분산 감소 기반 선호도 최적화

초록

마스크 확산 모델(Masked Diffusion Models, MDMs)인 LLaDA와 같은 모델들은 언어 모델링을 위한 유망한 패러다임을 제시하지만, 이러한 모델들을 강화 학습을 통해 인간의 선호도와 정렬시키려는 노력은 상대적으로 적었습니다. 이러한 문제는 주로 선호도 최적화에 필요한 Evidence Lower Bound(ELBO) 기반의 가능도 추정치에서 발생하는 높은 분산에서 비롯됩니다. 이 문제를 해결하기 위해, 우리는 분산 감소 선호도 최적화(Variance-Reduced Preference Optimization, VRPO) 프레임워크를 제안합니다. 이 프레임워크는 ELBO 추정기의 분산을 공식적으로 분석하고, 선호도 최적화 그래디언트의 편향과 분산에 대한 경계를 도출합니다. 이러한 이론적 기반을 바탕으로, 우리는 최적의 몬테카를로 예산 할당 및 반대 샘플링과 같은 편향 없는 분산 감소 전략을 도입하여 MDM 정렬의 성능을 크게 향상시킵니다. 우리는 VRPO를 LLaDA에 적용하여 그 효과를 입증했으며, 그 결과로 나온 LLaDA 1.5 모델은 수학(GSM8K +4.7), 코드(HumanEval +3.0, MBPP +1.8), 그리고 정렬 벤치마크(IFEval +4.0, Arena-Hard +4.3)에서 SFT-only 전임자를 일관되게 그리고 상당히 능가하는 성능을 보여줍니다. 더 나아가, LLaDA 1.5는 강력한 언어 MDM 및 ARM과 비교했을 때 매우 경쟁력 있는 수학적 성능을 보여줍니다. 프로젝트 페이지: https://ml-gsai.github.io/LLaDA-1.5-Demo/.

English

While Masked Diffusion Models (MDMs), such as LLaDA, present a promising paradigm for language modeling, there has been relatively little effort in aligning these models with human preferences via reinforcement learning. The challenge primarily arises from the high variance in Evidence Lower Bound (ELBO)-based likelihood estimates required for preference optimization. To address this issue, we propose Variance-Reduced Preference Optimization (VRPO), a framework that formally analyzes the variance of ELBO estimators and derives bounds on both the bias and variance of preference optimization gradients. Building on this theoretical foundation, we introduce unbiased variance reduction strategies, including optimal Monte Carlo budget allocation and antithetic sampling, that significantly improve the performance of MDM alignment. We demonstrate the effectiveness of VRPO by applying it to LLaDA, and the resulting model, LLaDA 1.5, outperforms its SFT-only predecessor consistently and significantly across mathematical (GSM8K +4.7), code (HumanEval +3.0, MBPP +1.8), and alignment benchmarks (IFEval +4.0, Arena-Hard +4.3). Furthermore, LLaDA 1.5 demonstrates a highly competitive mathematical performance compared to strong language MDMs and ARMs. Project page: https://ml-gsai.github.io/LLaDA-1.5-Demo/.

LLaDA 1.5: 대규모 언어 확산 모델을 위한 분산 감소 기반 선호도 최적화

LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

초록

Support