區塊擴散：自回歸與擴散語言模型之間的插值

摘要

擴散語言模型相較於自回歸模型具有獨特優勢，因其具備並行生成與可控性的潛力，然而在概率建模方面表現稍遜，且僅限於固定長度的生成。本研究引入了一類塊擴散語言模型，該模型在離散去噪擴散與自回歸模型之間進行了折衷。塊擴散通過支持靈活長度生成，並利用KV緩存與並行令牌採樣提升推理效率，克服了兩種方法的關鍵限制。我們提出了一套構建高效塊擴散模型的方案，包括高效的訓練算法、梯度方差估計器，以及數據驅動的噪聲調度以最小化方差。塊擴散在語言建模基準測試中為擴散模型樹立了新的性能標杆，並實現了任意長度序列的生成。我們在項目頁面提供了代碼、模型權重及博客文章：https://m-arriola.com/bd3lms/。

English

Diffusion language models offer unique benefits over autoregressive models due to their potential for parallelized generation and controllability, yet they lag in likelihood modeling and are limited to fixed-length generation. In this work, we introduce a class of block diffusion language models that interpolate between discrete denoising diffusion and autoregressive models. Block diffusion overcomes key limitations of both approaches by supporting flexible-length generation and improving inference efficiency with KV caching and parallel token sampling. We propose a recipe for building effective block diffusion models that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. Block diffusion sets a new state-of-the-art performance among diffusion models on language modeling benchmarks and enables generation of arbitrary-length sequences. We provide the code, along with the model weights and blog post on the project page: https://m-arriola.com/bd3lms/

區塊擴散：自回歸與擴散語言模型之間的插值

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

摘要

Support