LangFlow: 언어 모델링에서 연속 확산이 이산적 방법과 맞서다

초록

연속 확산은 이미지와 같은 다양한 데이터 양식에서 고품질, 제어 가능하며 적은 단계의 생성을 위한 기반이 되어왔습니다. 그러나 언어 모델링 분야에서는 데이터 공간의 희소성과 미흡하게 탐구된 설계 공간으로 인해 기존 연속 확산 언어 모델(DLM)이 이산 확산 모델에 뒤처져 왔습니다. 본 연구에서는 Bregman 발산을 통해 임베딩 공간 DLM을 Flow Matching에 연결하고 세 가지 핵심 혁신을 통해 이러한 격차를 해소하는 최초의 연속 DLM인 LangFlow를 제안합니다: (1) 연속 흐름 기반 언어 모델의 원칙적인 평가를 위한 새로운 ODE 기반 NLL 하한을 유도합니다; (2) 정보 균일 원칙에 기반한 노이즈 스케줄 설정 방법을 제안하며, 이는 Gumbel 분포 기반의 학습 가능한 노이즈 스케줄러로 이어집니다; (3) 자기 조건화를 통합하여 기존 훈련 프로토콜을 개선하며, 이산 확산과는 상당히 다른 효과로 임베딩 공간 DLM의 가능도와 샘플 품질을 모두 향상시킵니다. 이를 종합한 LangFlow는 LM1B에서 30.0, OpenWebText에서 24.6의 PPL을 기록하여 복잡도(PPL)와 생성 복잡도(Gen. PPL) 모두에서 최고 수준의 이산 DLM과 대등한 성능을 보입니다. 또한 7개 벤치마크 중 4개에서 제로샷 전이 평가 시 자기회귀 기반 모델을 능가합니다. LangFlow는 연속 확산이 언어 모델링에 유망한 패러다임임을 입증하는 첫 명확한 증거를 제시합니다. 홈페이지: https://github.com/nealchen2003/LangFlow

English

Continuous diffusion has been the foundation of high-fidelity, controllable, and few-step generation of many data modalities such as images. However, in language modeling, prior continuous diffusion language models (DLMs) lag behind discrete counterparts due to the sparse data space and the underexplored design space. In this work, we close this gap with LangFlow, the first continuous DLM to rival discrete diffusion, by connecting embedding-space DLMs to Flow Matching via Bregman divergence, alongside three key innovations: (1) we derive a novel ODE-based NLL bound for principled evaluation of continuous flow-based language models; (2) we propose an information-uniform principle for setting the noise schedule, which motivates a learnable noise scheduler based on a Gumbel distribution; and (3) we revise prior training protocols by incorporating self-conditioning, as we find it improves both likelihood and sample quality of embedding-space DLMs with effects substantially different from discrete diffusion. Putting everything together, LangFlow rivals top discrete DLMs on both the perplexity (PPL) and the generative perplexity (Gen. PPL), reaching a PPL of 30.0 on LM1B and 24.6 on OpenWebText. It even exceeds autoregressive baselines in zero-shot transfer on 4 out of 7 benchmarks. LangFlow provides the first clear evidence that continuous diffusion is a promising paradigm for language modeling. Homepage: https://github.com/nealchen2003/LangFlow

LangFlow: 언어 모델링에서 연속 확산이 이산적 방법과 맞서다

LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

초록

Support