TextLDM: 연속 잠재 확산을 이용한 언어 모델링

초록

VAE 잠재 공간에서 흐름 정합(flow matching)으로 학습된 확산 트랜스포머(DiT)는 이미지와 비디오에 걸쳐 시각적 생성을 통합했습니다. 생성(시각적 합성)과 이해(텍스트 생성)를 모두 처리하는 단일 아키텍처를 향한 자연스러운 다음 단계는 이 프레임워크를 언어 모델링에 적용하는 것입니다. 본 논문에서는 최소한의 아키텍처 수정으로 시각적 잠재 확산 레시피를 텍스트 생성으로 이전하는 TextLDM을 제안합니다. 트랜스포머 기반 VAE는 이산 토큰을 연속 잠재 변수로 매핑하며, 동결된 사전 학습 언어 모델과의 표현 정렬(REPA)을 통해 조건부 잡음 제거에 효과적인 표현을 생성하도록 강화됩니다. 그런 다음 표준 DiT가 이 잠재 공간에서 흐름 정합을 수행하며, 아키텍처는 시각적 버전과 동일합니다. 본 연구가 해결하는 핵심 과제는 고품질의 연속 텍스트 표현을 얻는 것입니다: 재구성 충실도만으로는 충분하지 않으며, REPA를 통해 사전 학습된 언어 모델과 잠재 특징을 정렬하는 것이 다운스트림 생성 품질에 결정적임을 발견했습니다. OpenWebText2에서 처음부터 학습된 TextLDM은 기존 확산 언어 모델을 크게 능가하며 동일한 설정에서 GPT-2와 일치하는 성능을 보입니다. 이 결과는 시각적 DiT 레시피가 언어에 효과적으로 이전됨을 입증하며, 다중 모달 생성 및 이해를 위한 통합 확산 아키텍처를 향한 구체적인 한 걸음을 내딛습니다.

English

Diffusion Transformers (DiT) trained with flow matching in a VAE latent space have unified visual generation across images and videos. A natural next step toward a single architecture for both generation (visual synthesis) and understanding (text generation) is to apply this framework to language modeling. We propose TextLDM, which transfers the visual latent diffusion recipe to text generation with minimal architectural modification. A Transformer-based VAE maps discrete tokens to continuous latents, enhanced by Representation Alignment (REPA) with a frozen pretrained language model to produce representations effective for conditional denoising. A standard DiT then performs flow matching in this latent space, identical in architecture to its visual counterpart. The central challenge we address is obtaining high-quality continuous text representations: we find that reconstruction fidelity alone is insufficient, and that aligning latent features with a pretrained language model via REPA is critical for downstream generation quality. Trained from scratch on OpenWebText2, TextLDM substantially outperforms prior diffusion language models and matches GPT-2 under the same settings. Our results establish that the visual DiT recipe transfers effectively to language, taking a concrete step toward unified diffusion architectures for multimodal generation and understanding.