连续潜在扩散语言模型
Continuous Latent Diffusion Language Model
May 7, 2026
作者: Hongcan Guo, Qinyu Zhao, Yian Zhao, Shen Nie, Rui Zhu, Qiushan Guo, Feng Wang, Tao Yang, Hengshuang Zhao, Guoqiang Wei, Yan Zeng
cs.AI
摘要
大型语言模型在自回归范式下已取得显著成就,但高质量文本生成未必需要固守固定的从左到右顺序。现有替代方案仍难以同时实现生成效率、可扩展的表征学习和有效的全局语义建模。我们提出Cola DLM——一种通过层次化信息分解框架实现文本生成的分层隐扩散语言模型。该模型首先通过文本变分自编码器学习稳定的文本到隐空间的映射,接着采用块因果DiT在连续隐空间中对全局语义先验进行建模,最后通过条件解码生成文本。从统一的马尔可夫路径视角看,其扩散过程执行的是隐空间先验传输而非词元级观测恢复,从而将全局语义组织与局部文本实现相分离。这种设计产生了更灵活的非自回归归纳偏置,支持连续空间中的语义压缩和先验拟合,并能自然扩展到其他连续模态。通过涵盖4个研究问题、8个基准测试、严格匹配约20亿参数的自回归与LLaDA基线模型、以及扩展至约2000 EFLOPs的缩放曲线实验,我们确定了Cola DLM的有效整体配置,并验证了其在文本生成方面的强大缩放特性。综合来看,研究结果确立了分层连续隐空间先验建模作为严格词元级语言建模的原理性替代方案——其中生成质量与缩放特性可能比似然度更能反映模型能力,同时也为离散文本与连续模态的统一建模指明了可行路径。
English
Large language models have achieved remarkable success under the autoregressive paradigm, yet high-quality text generation need not be tied to a fixed left-to-right order. Existing alternatives still struggle to jointly achieve generation efficiency, scalable representation learning, and effective global semantic modeling. We propose Cola DLM, a hierarchical latent diffusion language model that frames text generation through hierarchical information decomposition. Cola DLM first learns a stable text-to-latent mapping with a Text VAE, then models a global semantic prior in continuous latent space with a block-causal DiT, and finally generates text through conditional decoding. From a unified Markov-path perspective, its diffusion process performs latent prior transport rather than token-level observation recovery, thereby separating global semantic organization from local textual realization. This design yields a more flexible non-autoregressive inductive bias, supports semantic compression and prior fitting in continuous space, and naturally extends to other continuous modalities. Through experiments spanning 4 research questions, 8 benchmarks, strictly matched ~2B-parameter autoregressive and LLaDA baselines, and scaling curves up to about 2000 EFLOPs, we identify an effective overall configuration of Cola DLM and verify its strong scaling behavior for text generation. Taken together, the results establish hierarchical continuous latent prior modeling as a principled alternative to strictly token-level language modeling, where generation quality and scaling behavior may better reflect model capability than likelihood, while also suggesting a concrete path toward unified modeling across discrete text and continuous modalities.