改进的大型语言扩散模型

摘要

现代大型语言模型主要采用自回归分解和因果注意力进行训练。我们提出iLLaDA，这是一个80亿参数的掩码扩散语言模型，从零开始训练，采用完全双向注意力机制。iLLaDA在预训练和监督微调（SFT）阶段全程保持掩码扩散目标，预训练规模扩展至12万亿tokens，并在250亿tokens的指令数据集上进行了12个周期的微调。此外，我们采用可变长度生成以提高效率，并引入基于置信度的评分用于多项选择评估。与LLaDA相比，iLLaDA在通用、数学和代码基准测试中均有显著提升；例如，iLLaDA-Base在BBH上提升21.6个百分点，在ARC-Challenge上提升14.9个百分点，而iLLaDA-Instruct在MATH上提升14.5个百分点，在HumanEval上提升16.5个百分点。尽管采用非自回归训练，iLLaDA在多个基准测试中仍能与Qwen2.5 7B保持竞争力。这些结果表明，从零开始进行完全双向扩散训练是通往强大语言模型的竞争性途径。模型权重和代码：https://github.com/ML-GSAI/LLaDA。

English

Modern large language models are predominantly trained with autoregressive factorization and causal attention. We present iLLaDA, an 8B masked diffusion language model trained from scratch with fully bidirectional attention. iLLaDA keeps the masked diffusion objective throughout pre-training and supervised fine-tuning (SFT), scaling pre-training to 12T tokens and fine-tuning on a 25B-token instruction corpus for 12 epochs. We further use variable-length generation for efficiency and introduce confidence-based scoring for multiple-choice evaluation. Compared with LLaDA, iLLaDA improves broadly across general, mathematical, and code benchmarks; for example, iLLaDA-Base improves by 21.6 points on BBH and 14.9 points on ARC-Challenge, while iLLaDA-Instruct improves by 14.5 points on MATH and 16.5 points on HumanEval. Despite its non-autoregressive training, iLLaDA also remains competitive with Qwen2.5 7B on several benchmarks. These results show that fully bidirectional diffusion training from scratch is a competitive path toward strong language models. Model weights and codes: https://github.com/ML-GSAI/LLaDA.