离散扩散语言模型的尺度行为
Scaling Behavior of Discrete Diffusion Language Models
December 11, 2025
作者: Dimitri von Rütte, Janis Fluri, Omead Pooladzandi, Bernhard Schölkopf, Thomas Hofmann, Antonio Orvieto
cs.AI
摘要
现代大语言模型预训练消耗巨大的计算资源和训练数据,使得不同模型的缩放行为(即缩放定律)成为关键区分因素。离散扩散语言模型作为自回归语言模型的替代方案被提出,但其缩放特性尚未得到充分探索——先前研究表明DLMs需要更多数据和计算才能达到ALMs的性能水平。
我们通过平滑插值掩码扩散与均匀扩散,并重点关注批次大小和学习率等关键超参数,系统研究了不同噪声类型下DLMs的缩放行为。实验表明:DLMs的缩放特性强烈依赖于噪声类型,且与ALMs存在显著差异。虽然所有噪声类型在计算受限的缩放场景下最终会收敛至相近的损失值,但发现均匀扩散相比掩码扩散在计算效率优化训练中需要更多参数和更少数据,这使其在数据受限场景中具有显著优势。我们将均匀扩散模型扩展至100亿参数规模,训练计算量达10^22 FLOPs,不仅验证了预测的缩放规律,还创造了目前公开已知的最大规模均匀扩散模型。
English
Modern LLM pre-training consumes vast amounts of compute and training data, making the scaling behavior, or scaling laws, of different models a key distinguishing factor. Discrete diffusion language models (DLMs) have been proposed as an alternative to autoregressive language models (ALMs). However, their scaling behavior has not yet been fully explored, with prior work suggesting that they require more data and compute to match the performance of ALMs.
We study the scaling behavior of DLMs on different noise types by smoothly interpolating between masked and uniform diffusion while paying close attention to crucial hyperparameters such as batch size and learning rate. Our experiments reveal that the scaling behavior of DLMs strongly depends on the noise type and is considerably different from ALMs. While all noise types converge to similar loss values in compute-bound scaling, we find that uniform diffusion requires more parameters and less data for compute-efficient training compared to masked diffusion, making them a promising candidate in data-bound settings. We scale our uniform diffusion model up to 10B parameters trained for 10^{22} FLOPs, confirming the predicted scaling behavior and making it the largest publicly known uniform diffusion model to date.