TextLDM: 基於連續潛在擴散的語言建模

摘要

在變分自編碼器（VAE）潛空間中以流匹配訓練的擴散變壓器（DiT），已統一了影像與影片的視覺生成。為了實現單一架構同時支援生成（視覺合成）與理解（文字生成）的下一步自然發展，便是將此框架應用於語言建模。我們提出 TextLDM，透過最少架構修改，將視覺潛在擴散方法遷移至文字生成。基於變壓器的 VAE 將離散符號映射至連續潛變量，並透過與凍結的預訓練語言模型進行表徵對齊（REPA），以強化生成適合條件去噪的表徵。隨後，標準 DiT 在此潛空間中執行流匹配，其架構與視覺版本完全相同。我們所解決的核心挑戰在於獲取高品質的連續文字表徵：我們發現僅有重建保真度是不夠的，透過 REPA 將潛在特徵與預訓練語言模型對齊，對於下游生成品質至關重要。在 OpenWebText2 上從頭訓練的 TextLDM，大幅優於先前的擴散語言模型，並在相同設定下達到與 GPT-2 相當的表現。我們的結果證實，視覺 DiT 配方能有效遷移至語言領域，為實現多模態生成與理解的統一擴散架構邁出具體一步。

English

Diffusion Transformers (DiT) trained with flow matching in a VAE latent space have unified visual generation across images and videos. A natural next step toward a single architecture for both generation (visual synthesis) and understanding (text generation) is to apply this framework to language modeling. We propose TextLDM, which transfers the visual latent diffusion recipe to text generation with minimal architectural modification. A Transformer-based VAE maps discrete tokens to continuous latents, enhanced by Representation Alignment (REPA) with a frozen pretrained language model to produce representations effective for conditional denoising. A standard DiT then performs flow matching in this latent space, identical in architecture to its visual counterpart. The central challenge we address is obtaining high-quality continuous text representations: we find that reconstruction fidelity alone is insufficient, and that aligning latent features with a pretrained language model via REPA is critical for downstream generation quality. Trained from scratch on OpenWebText2, TextLDM substantially outperforms prior diffusion language models and matches GPT-2 under the same settings. Our results establish that the visual DiT recipe transfers effectively to language, taking a concrete step toward unified diffusion architectures for multimodal generation and understanding.

TextLDM: 基於連續潛在擴散的語言建模

TextLDM: Language Modeling with Continuous Latent Diffusion

摘要

Support