Verbeterde Grote Taaldiffusiemodellen

Samenvatting

Moderne grote taalmodellen worden overwegend getraind met autoregressieve factorisatie en causale aandacht. Wij presenteren iLLaDA, een 8B gemaskeerd diffusie-taalmodel dat volledig van scratch is getraind met volledig bidirectionele aandacht. iLLaDA behoudt de gemaskeerde diffusiedoelstelling gedurende pre-training en supervised fine-tuning (SFT), waarbij pre-training wordt opgeschaald naar 12T tokens en fine-tuning op een instructiecorpus van 25B tokens gedurende 12 epochs. Verder gebruiken we generatie met variabele lengte voor efficiëntie en introduceren we op vertrouwen gebaseerde scoring voor meerkeuzeevaluatie. Vergeleken met LLaDA presteert iLLaDA breed beter op algemene, wiskundige en code-benchmarks; bijvoorbeeld, iLLaDA-Base verbetert met 21.6 punten op BBH en 14.9 punten op ARC-Challenge, terwijl iLLaDA-Instruct verbetert met 14.5 punten op MATH en 16.5 punten op HumanEval. Ondanks de niet-autoregressieve training blijft iLLaDA ook concurrerend met Qwen2.5 7B op verschillende benchmarks. Deze resultaten tonen aan dat volledig bidirectionele diffusietraining vanaf scratch een concurrerende weg is naar sterke taalmodellen. Modelgewichten en codes: https://github.com/ML-GSAI/LLaDA.

English

Modern large language models are predominantly trained with autoregressive factorization and causal attention. We present iLLaDA, an 8B masked diffusion language model trained from scratch with fully bidirectional attention. iLLaDA keeps the masked diffusion objective throughout pre-training and supervised fine-tuning (SFT), scaling pre-training to 12T tokens and fine-tuning on a 25B-token instruction corpus for 12 epochs. We further use variable-length generation for efficiency and introduce confidence-based scoring for multiple-choice evaluation. Compared with LLaDA, iLLaDA improves broadly across general, mathematical, and code benchmarks; for example, iLLaDA-Base improves by 21.6 points on BBH and 14.9 points on ARC-Challenge, while iLLaDA-Instruct improves by 14.5 points on MATH and 16.5 points on HumanEval. Despite its non-autoregressive training, iLLaDA also remains competitive with Qwen2.5 7B on several benchmarks. These results show that fully bidirectional diffusion training from scratch is a competitive path toward strong language models. Model weights and codes: https://github.com/ML-GSAI/LLaDA.