扩散语言模型是卓越的数据学习器
Diffusion Language Models are Super Data Learners
November 5, 2025
作者: Jinjie Ni, Qian Liu, Longxu Dou, Chao Du, Zili Wang, Hang Yan, Tianyu Pang, Michael Qizhe Shieh
cs.AI
摘要
在严格控制的预训练设置下,我们观察到一个交叉现象:当独特数据有限时,通过增加训练轮次,扩散语言模型(DLM)会持续超越自回归(AR)模型。这种交叉点会随着数据量增加或质量提升而延后出现,随着模型规模扩大而提前出现,并且在稠密与稀疏架构中均稳定存在。我们将性能提升归因于三个复合因素:(1)任意顺序建模能力,(2)迭代式双向去噪带来的超密集计算,以及(3)内置的蒙特卡洛增强机制;虽然输入噪声或参数噪声能在数据受限时提升AR模型表现,但无法弥合这一差距。在规模化实验中,一个17亿参数的DLM使用约1.5万亿token的计算预算,在100亿独特Python token上训练后,超越了在严格匹配设置下训练的AR代码生成模型。此外,一个10亿参数的DLM仅使用10亿token进行标准预训练数据重复训练(未采用特殊技巧),便在HellaSwag上达到超过56%的准确率,在MMLU上超过33%。我们还发现,在这种训练机制下,验证集交叉熵的上升并不代表下游任务性能的退化。
English
Under strictly controlled pre-training settings, we observe a Crossover: when
unique data is limited, diffusion language models (DLMs) consistently surpass
autoregressive (AR) models by training for more epochs. The crossover shifts
later with more or higher-quality data, earlier with larger models, and
persists across dense and sparse architectures. We attribute the gains to three
compounding factors: (1) any-order modeling, (2) super-dense compute from
iterative bidirectional denoising, and (3) built-in Monte Carlo augmentation;
input or parameter noise improves AR under data constraint but cannot close the
gap. At scale, a 1.7B DLM trained with a ~1.5T-token compute budget on 10B
unique Python tokens overtakes an AR coder trained with strictly matched
settings. In addition, a 1B-parameter DLM achieves > 56% accuracy on HellaSwag
and > 33% on MMLU using only 1B tokens, without any special tricks, just by
repeating standard pre-training data. We also show that rising validation
cross-entropy does not imply degraded downstream performance in this regime.