연속 잠재 디퓨전 언어 모델

초록

대규모 언어 모델은 자기회귀 패러다임 하에서 놀라운 성공을 거두었으나, 고품질 텍스트 생성이 반드시 고정된 좌측-우측 순서에 얽매일 필요는 없습니다. 기존 대안들은 생성 효율성, 확장 가능한 표현 학습, 효과적인 전역 의미 모델링을 동시에 달성하는 데 여전히 어려움을 겪고 있습니다. 본 연구에서는 텍스트 생성을 계층적 정보 분해를 통해 접근하는 계층적 잠재 디퓨전 언어 모델인 Cola DLM을 제안합니다. Cola DLM은 먼저 Text VAE로 안정적인 텍스트-잠재 매핑을 학습한 후, 블록-인과 DiT를 사용하여 연속 잠재 공간에서 전역 의미 사전 분포를 모델링하고, 마지막으로 조건부 디코딩을 통해 텍스트를 생성합니다. 통합 마르코프 경로 관점에서 볼 때, 이의 디퓨전 과정은 토큰 수준의 관측치 복원이 아닌 잠재 사전 분포 전달을 수행함으로써 전역 의미 구성과 지역적 텍스트 구현을 분리합니다. 이러한 설계는 더 유연한 비자기회귀적 귀납적 편향을 제공하며, 연속 공간에서의 의미 압축 및 사전 분포 적합을 지원하고, 다른 연속 양상으로의 자연스러운 확장이 가능합니다. 4개의 연구 문제, 8개의 벤치마크, 엄격하게 매칭된 약 20억 파라미터 규모의 자기회귀 및 LLaDA 기준 모델, 그리고 약 2000 EFLOPs에 이르는 스케일링 곡선을 아우르는 실험을 통해, 우리는 Cola DLM의 효과적인 전체 구성을 확인하고 텍스트 생성에 대한 그 강력한 스케일링 동작을 검증합니다. 종합적으로, 이러한 결과는 계층적 연속 잠재 사전 분포 모델링이 엄격한 토큰 수준 언어 모델링의 원리적 대안으로 자리매김함을 입증합니다. 여기서는 생성 품질과 스케일링 동작이 가능도보다 모델 능력을 더 잘 반영할 수 있으며, 동시에 이산적 텍스트와 연속 양상을 아우르는 통합 모델링을 위한 구체적인 경로를 제시합니다.

English

Large language models have achieved remarkable success under the autoregressive paradigm, yet high-quality text generation need not be tied to a fixed left-to-right order. Existing alternatives still struggle to jointly achieve generation efficiency, scalable representation learning, and effective global semantic modeling. We propose Cola DLM, a hierarchical latent diffusion language model that frames text generation through hierarchical information decomposition. Cola DLM first learns a stable text-to-latent mapping with a Text VAE, then models a global semantic prior in continuous latent space with a block-causal DiT, and finally generates text through conditional decoding. From a unified Markov-path perspective, its diffusion process performs latent prior transport rather than token-level observation recovery, thereby separating global semantic organization from local textual realization. This design yields a more flexible non-autoregressive inductive bias, supports semantic compression and prior fitting in continuous space, and naturally extends to other continuous modalities. Through experiments spanning 4 research questions, 8 benchmarks, strictly matched ~2B-parameter autoregressive and LLaDA baselines, and scaling curves up to about 2000 EFLOPs, we identify an effective overall configuration of Cola DLM and verify its strong scaling behavior for text generation. Taken together, the results establish hierarchical continuous latent prior modeling as a principled alternative to strictly token-level language modeling, where generation quality and scaling behavior may better reflect model capability than likelihood, while also suggesting a concrete path toward unified modeling across discrete text and continuous modalities.

연속 잠재 디퓨전 언어 모델

Continuous Latent Diffusion Language Model

초록

Support