改良された大規模言語拡散モデル

要旨

現代の大規模言語モデルは、主に自己回帰因子分解と因果的注意を用いて訓練されている。本稿では、完全な双方向注意を用いてゼロから訓練された8Bマスク拡散言語モデルiLLaDAを提案する。iLLaDAは、事前学習および教師ありファインチューニング（SFT）を通じてマスク拡散目的関数を維持し、事前学習を12Tトークンに拡大し、25Bトークンの指示コーパスで12エポックのファインチューニングを実施する。さらに、効率化のための可変長生成を導入し、多肢選択評価のための信頼度ベースのスコアリングを提案する。LLaDAと比較して、iLLaDAは一般、数学、コードの各ベンチマークで広く改善を示した。例えば、iLLaDA-BaseはBBHで21.6ポイント、ARC-Challengeで14.9ポイント向上し、iLLaDA-InstructはMATHで14.5ポイント、HumanEvalで16.5ポイント向上した。非自己回帰的な訓練にもかかわらず、iLLaDAはいくつかのベンチマークでQwen2.5 7Bと競争力のある性能を維持している。これらの結果は、ゼロからの完全双方向拡散訓練が強力な言語モデルへの競争力のある経路であることを示している。モデルの重みとコード: https://github.com/ML-GSAI/LLaDA。

English

Modern large language models are predominantly trained with autoregressive factorization and causal attention. We present iLLaDA, an 8B masked diffusion language model trained from scratch with fully bidirectional attention. iLLaDA keeps the masked diffusion objective throughout pre-training and supervised fine-tuning (SFT), scaling pre-training to 12T tokens and fine-tuning on a 25B-token instruction corpus for 12 epochs. We further use variable-length generation for efficiency and introduce confidence-based scoring for multiple-choice evaluation. Compared with LLaDA, iLLaDA improves broadly across general, mathematical, and code benchmarks; for example, iLLaDA-Base improves by 21.6 points on BBH and 14.9 points on ARC-Challenge, while iLLaDA-Instruct improves by 14.5 points on MATH and 16.5 points on HumanEval. Despite its non-autoregressive training, iLLaDA also remains competitive with Qwen2.5 7B on several benchmarks. These results show that fully bidirectional diffusion training from scratch is a competitive path toward strong language models. Model weights and codes: https://github.com/ML-GSAI/LLaDA.