TextLDM：使用连续潜在扩散的语言建模

摘要

扩散变换器（DiT）在VAE潜在空间中通过流匹配训练，已实现图像与视频的统一视觉生成。将这一框架应用于语言建模，是迈向统一架构（同时支持视觉合成与文本生成）的合理下一步。我们提出TextLDM，以最小的架构修改将视觉潜在扩散方案迁移至文本生成。基于Transformer的VAE将离散令牌映射为连续潜在表示，并通过与冻结预训练语言模型进行表示对齐（REPA），以增强条件去噪的有效性。随后，标准DiT在此潜在空间中执行流匹配，其架构与视觉DiT完全相同。我们解决的核心难题在于获取高质量的连续文本表示：研究发现，仅依赖重建保真度不足以保证下游生成质量，通过REPA使潜在特征与预训练语言模型对齐至关重要。TextLDM在OpenWebText2上从零训练，显著优于先前的扩散语言模型，并在相同设置下达到GPT-2水平。该结果表明，视觉DiT方案可有效迁移至语言领域，为迈向多模态生成与理解的统一扩散架构迈出实质性一步。

English

Diffusion Transformers (DiT) trained with flow matching in a VAE latent space have unified visual generation across images and videos. A natural next step toward a single architecture for both generation (visual synthesis) and understanding (text generation) is to apply this framework to language modeling. We propose TextLDM, which transfers the visual latent diffusion recipe to text generation with minimal architectural modification. A Transformer-based VAE maps discrete tokens to continuous latents, enhanced by Representation Alignment (REPA) with a frozen pretrained language model to produce representations effective for conditional denoising. A standard DiT then performs flow matching in this latent space, identical in architecture to its visual counterpart. The central challenge we address is obtaining high-quality continuous text representations: we find that reconstruction fidelity alone is insufficient, and that aligning latent features with a pretrained language model via REPA is critical for downstream generation quality. Trained from scratch on OpenWebText2, TextLDM substantially outperforms prior diffusion language models and matches GPT-2 under the same settings. Our results establish that the visual DiT recipe transfers effectively to language, taking a concrete step toward unified diffusion architectures for multimodal generation and understanding.

TextLDM：使用连续潜在扩散的语言建模

TextLDM: Language Modeling with Continuous Latent Diffusion

摘要

Support