连续去噪：一步到位的语言建模方法

摘要

基于离散扩散的语言模型因其有望比自回归模型实现更快生成而广受关注。然而在实际应用中，这类模型在少步数生成场景下会出现样本质量急剧下降的问题，未能兑现其潜力。本文提出，利用基于流的连续去噪方法构建的语言模型，在生成质量和速度上均能超越离散扩散模型。通过重新审视离散模态上流模型的基本原理，我们构建了基于流的语言模型（FLM），该模型对独热编码的词元表示执行欧几里得空间去噪。我们证明该模型可通过交叉熵目标预测纯净数据的方式进行训练，并引入一种简单的时间重参数化方法显著提升训练稳定性和生成质量。通过将FLM蒸馏至其对应的流映射，我们获得了可实现少步数生成的蒸馏流映射语言模型（FMLM）。在LM1B和OWT语言数据集上，FLM的生成质量达到了最先进离散扩散模型的水准。而FMLM模型在各项指标上全面优于近期提出的少步数语言模型，其单步生成质量甚至超过这些模型8步生成的效果。我们的研究对"离散扩散过程是离散模态生成建模必要条件"这一普遍假设提出了质疑，为大规模流式语言模型的加速发展开辟了新路径。代码已发布于https://github.com/david3684/flm。

English

Language models based on discrete diffusion have attracted widespread interest for their potential to provide faster generation than autoregressive models. In practice, however, they exhibit a sharp degradation of sample quality in the few-step regime, failing to realize this promise. Here we show that language models leveraging flow-based continuous denoising can outperform discrete diffusion in both quality and speed. By revisiting the fundamentals of flows over discrete modalities, we build a flow-based language model (FLM) that performs Euclidean denoising over one-hot token encodings. We show that the model can be trained by predicting the clean data via a cross entropy objective, where we introduce a simple time reparameterization that greatly improves training stability and generation quality. By distilling FLM into its associated flow map, we obtain a distilled flow map language model (FMLM) capable of few-step generation. On the LM1B and OWT language datasets, FLM attains generation quality matching state-of-the-art discrete diffusion models. With FMLM, our approach outperforms recent few-step language models across the board, with one-step generation exceeding their 8-step quality. Our work calls into question the widely held hypothesis that discrete diffusion processes are necessary for generative modeling over discrete modalities, and paves the way toward accelerated flow-based language modeling at scale. Code is available at https://github.com/david3684/flm.

连续去噪：一步到位的语言建模方法

One-step Language Modeling via Continuous Denoising

摘要

Support