一步式连续去噪语言建模

摘要

基于离散扩散的语言模型因其有望实现比自回归模型更快的生成速度而广受关注。然而在实际应用中，这类模型在少步数生成场景下会出现样本质量的急剧下降，未能兑现其潜力。本文研究表明，利用基于流的连续去噪方法构建的语言模型，在生成质量和速度上均能超越离散扩散模型。通过重新审视离散模态上流模型的基本原理，我们构建了基于流的语言模型（FLM），该模型在独热编码的词元表示空间执行欧几里得去噪操作。我们证明该模型可通过交叉熵目标预测纯净数据的方式进行训练，并引入一种简单的时间重参数化方法，显著提升了训练稳定性和生成质量。通过将FLM蒸馏至其关联的流映射，我们获得了具备少步生成能力的蒸馏流映射语言模型（FMLM）。在LM1B和OWT语言数据集上，FLM的生成质量达到了最先进离散扩散模型的水平。而FMLM则在所有指标上全面超越近期的少步生成语言模型，其单步生成质量甚至优于这些模型的8步生成效果。本研究对"离散扩散过程是离散模态生成建模的必要条件"这一广泛接受的假设提出了质疑，为大规模流式语言模型的加速发展开辟了新路径。代码已发布于https://github.com/david3684/flm。

English

Language models based on discrete diffusion have attracted widespread interest for their potential to provide faster generation than autoregressive models. In practice, however, they exhibit a sharp degradation of sample quality in the few-step regime, failing to realize this promise. Here we show that language models leveraging flow-based continuous denoising can outperform discrete diffusion in both quality and speed. By revisiting the fundamentals of flows over discrete modalities, we build a flow-based language model (FLM) that performs Euclidean denoising over one-hot token encodings. We show that the model can be trained by predicting the clean data via a cross entropy objective, where we introduce a simple time reparameterization that greatly improves training stability and generation quality. By distilling FLM into its associated flow map, we obtain a distilled flow map language model (FMLM) capable of few-step generation. On the LM1B and OWT language datasets, FLM attains generation quality matching state-of-the-art discrete diffusion models. With FMLM, our approach outperforms recent few-step language models across the board, with one-step generation exceeding their 8-step quality. Our work calls into question the widely held hypothesis that discrete diffusion processes are necessary for generative modeling over discrete modalities, and paves the way toward accelerated flow-based language modeling at scale. Code is available at https://github.com/david3684/flm.

一步式连续去噪语言建模

One-step Language Modeling via Continuous Denoising

摘要

Support