通过自回归模型的适应实现扩展扩散语言模型

摘要

扩散语言模型（DLMs）已经成为一种有前途的新范式，用于文本生成建模，可能解决自回归（AR）模型的局限性。然而，与其AR对应物相比，目前的DLMs在较小规模上进行了研究，并且在语言建模基准上缺乏公平比较。此外，从头开始规模化训练扩散模型仍然具有挑战性。鉴于开源AR语言模型的普及，我们建议调整这些模型以构建文本扩散模型。我们展示了自回归和扩散建模目标之间的联系，并引入了一种简单的持续预训练方法，用于训练扩散模型。通过对语言建模、推理和常识基准的系统评估，我们展示了我们可以将参数范围从127M扩展到7B的AR模型（GPT2和LLaMA）转换为扩散模型DiffuGPT和DiffuLLaMA，使用少于200B标记进行训练。我们的实验结果显示，这些模型优于早期的DLMs，并且与其AR对应物具有竞争力。我们发布了一套DLMs（具有127M、355M和7B参数），能够生成流畅的文本，执行上下文学习，在不重新排序提示的情况下填充中间部分，并遵循指示。https://github.com/HKUNLP/DiffuLLaMA。

English

Diffusion Language Models (DLMs) have emerged as a promising new paradigm for text generative modeling, potentially addressing limitations of autoregressive (AR) models. However, current DLMs have been studied at a smaller scale compared to their AR counterparts and lack fair comparison on language modeling benchmarks. Additionally, training diffusion models from scratch at scale remains challenging. Given the prevalence of open-source AR language models, we propose adapting these models to build text diffusion models. We demonstrate connections between AR and diffusion modeling objectives and introduce a simple continual pre-training approach for training diffusion models. Through systematic evaluation on language modeling, reasoning, and commonsense benchmarks, we show that we can convert AR models ranging from 127M to 7B parameters (GPT2 and LLaMA) into diffusion models DiffuGPT and DiffuLLaMA, using less than 200B tokens for training. Our experimental results reveal that these models outperform earlier DLMs and are competitive with their AR counterparts. We release a suite of DLMs (with 127M, 355M, and 7B parameters) capable of generating fluent text, performing in-context learning, filling in the middle without prompt re-ordering, and following instructions https://github.com/HKUNLP/DiffuLLaMA.

通过自回归模型的适应实现扩展扩散语言模型

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

摘要

Support