LangFlow：连续扩散模型在语言建模领域比肩离散方法

摘要

连续扩散技术已成为图像等多种数据模态实现高保真度、可控性及少步生成的基础。然而在语言建模领域，由于稀疏数据空间和尚未充分探索的设计空间，以往的连续扩散语言模型（DLM）始终落后于离散模型。本研究通过将嵌入空间的DLM与基于Bregman散度的流匹配相结合，并引入三项关键创新，成功缩小了这一差距：首先，我们推导出基于常微分方程的新型负对数似然边界，为连续流式语言模型提供理论评估依据；其次，提出信息均匀性原则用于设置噪声调度，据此推导出基于Gumbel分布的可学习噪声调度器；最后，通过引入自条件机制改进训练流程，发现其对嵌入空间DLM的似然度和生成质量均有提升，且效果显著区别于离散扩散。综合这些创新，LangFlow在困惑度（PPL）和生成困惑度（Gen. PPL）指标上均达到顶尖离散DLM水平，在LM1B数据集上PPL达30.0，OpenWebText上达24.6，甚至在7个基准测试中有4个零样本迁移任务超越自回归基线。LangFlow首次有力证明连续扩散是语言建模领域极具前景的研究范式。项目主页：https://github.com/nealchen2003/LangFlow

English

Continuous diffusion has been the foundation of high-fidelity, controllable, and few-step generation of many data modalities such as images. However, in language modeling, prior continuous diffusion language models (DLMs) lag behind discrete counterparts due to the sparse data space and the underexplored design space. In this work, we close this gap with LangFlow, the first continuous DLM to rival discrete diffusion, by connecting embedding-space DLMs to Flow Matching via Bregman divergence, alongside three key innovations: (1) we derive a novel ODE-based NLL bound for principled evaluation of continuous flow-based language models; (2) we propose an information-uniform principle for setting the noise schedule, which motivates a learnable noise scheduler based on a Gumbel distribution; and (3) we revise prior training protocols by incorporating self-conditioning, as we find it improves both likelihood and sample quality of embedding-space DLMs with effects substantially different from discrete diffusion. Putting everything together, LangFlow rivals top discrete DLMs on both the perplexity (PPL) and the generative perplexity (Gen. PPL), reaching a PPL of 30.0 on LM1B and 24.6 on OpenWebText. It even exceeds autoregressive baselines in zero-shot transfer on 4 out of 7 benchmarks. LangFlow provides the first clear evidence that continuous diffusion is a promising paradigm for language modeling. Homepage: https://github.com/nealchen2003/LangFlow