TextLDM：連続的潜在拡散に基づく言語モデリング

要旨

VAE潜在空間でフローマッチングを用いて学習された拡散トランスフォーマー（DiT）は、画像と動画にわたる視覚生成を統合してきた。生成（視覚合成）と理解（テキスト生成）の両方のための単一アーキテクチャに向けた自然な次のステップは、このフレームワークを言語モデリングに適用することである。我々はTextLDMを提案する。これは視覚的潜在拡散のレシピを最小限のアーキテクチャ変更でテキスト生成に転用するものである。TransformerベースのVAEは離散トークンを連続潜在変数にマッピングし、凍結された事前学習言語モデルとの表現アライメント（REPA）によって強化され、条件付きデノイジングに効果的な表現を生成する。標準的なDiTはその後、この潜在空間でフローマッチングを実行するが、そのアーキテクチャは視覚版と同一である。我々が取り組む中心的な課題は、高品質な連続テキスト表現を得ることである。再構成忠実度だけでは不十分であり、REPAを介して潜在特徴を事前学習言語モデルと整合させることが、下流の生成品質にとって重要であることがわかった。OpenWebText2でスクラッチから学習されたTextLDMは、従来の拡散言語モデルを大幅に上回り、同じ設定でGPT-2に匹敵する。我々の結果は、視覚的DiTのレシピが言語に効果的に転用されることを立証し、マルチモーダル生成と理解のための統一拡散アーキテクチャに向けた具体的な一歩となる。

English

Diffusion Transformers (DiT) trained with flow matching in a VAE latent space have unified visual generation across images and videos. A natural next step toward a single architecture for both generation (visual synthesis) and understanding (text generation) is to apply this framework to language modeling. We propose TextLDM, which transfers the visual latent diffusion recipe to text generation with minimal architectural modification. A Transformer-based VAE maps discrete tokens to continuous latents, enhanced by Representation Alignment (REPA) with a frozen pretrained language model to produce representations effective for conditional denoising. A standard DiT then performs flow matching in this latent space, identical in architecture to its visual counterpart. The central challenge we address is obtaining high-quality continuous text representations: we find that reconstruction fidelity alone is insufficient, and that aligning latent features with a pretrained language model via REPA is critical for downstream generation quality. Trained from scratch on OpenWebText2, TextLDM substantially outperforms prior diffusion language models and matches GPT-2 under the same settings. Our results establish that the visual DiT recipe transfers effectively to language, taking a concrete step toward unified diffusion architectures for multimodal generation and understanding.