향상된 대규모 언어 확산 모델

초록

현대의 대규모 언어 모델은 주로 자기회귀적 분해와 인과적 어텐션을 사용하여 학습됩니다. 우리는 완전 양방향 어텐션을 갖춘 처음부터 학습된 80억(8B) 규모의 마스크 확산 언어 모델인 iLLaDA를 제시합니다. iLLaDA는 사전 학습과 지도 미세 조정(SFT) 전반에 걸쳐 마스크 확산 목표를 유지하며, 사전 학습을 12조(12T) 토큰으로 확장하고 250억(25B) 토큰 규모의 명령어 말뭉치에 대해 12 에포크 동안 미세 조정을 수행합니다. 또한 효율성을 위해 가변 길이 생성을 사용하고 객체식 평가에 신뢰도 기반 점수화를 도입합니다. LLaDA와 비교하여 iLLaDA는 일반, 수학, 코드 벤치마크 전반에서 광범위하게 성능이 향상되었습니다. 예를 들어, iLLaDA-Base는 BBH에서 21.6점, ARC-Challenge에서 14.9점 향상되었으며, iLLaDA-Instruct는 MATH에서 14.5점, HumanEval에서 16.5점 향상되었습니다. 비자기회귀적 학습임에도 불구하고 iLLaDA는 여러 벤치마크에서 Qwen2.5 7B와 경쟁력을 유지합니다. 이러한 결과는 처음부터 완전 양방향 확산 학습을 수행하는 것이 강력한 언어 모델을 향한 경쟁력 있는 경로임을 보여줍니다. 모델 가중치와 코드: https://github.com/ML-GSAI/LLaDA.

English

Modern large language models are predominantly trained with autoregressive factorization and causal attention. We present iLLaDA, an 8B masked diffusion language model trained from scratch with fully bidirectional attention. iLLaDA keeps the masked diffusion objective throughout pre-training and supervised fine-tuning (SFT), scaling pre-training to 12T tokens and fine-tuning on a 25B-token instruction corpus for 12 epochs. We further use variable-length generation for efficiency and introduce confidence-based scoring for multiple-choice evaluation. Compared with LLaDA, iLLaDA improves broadly across general, mathematical, and code benchmarks; for example, iLLaDA-Base improves by 21.6 points on BBH and 14.9 points on ARC-Challenge, while iLLaDA-Instruct improves by 14.5 points on MATH and 16.5 points on HumanEval. Despite its non-autoregressive training, iLLaDA also remains competitive with Qwen2.5 7B on several benchmarks. These results show that fully bidirectional diffusion training from scratch is a competitive path toward strong language models. Model weights and codes: https://github.com/ML-GSAI/LLaDA.