一段階連続的デノイジングによる言語モデリング

要旨

離散拡散に基づく言語モデルは、自己回帰モデルよりも高速な生成を実現する可能性から広く注目を集めている。しかし実際には、数ステップの生成体制においてサンプル品質が急激に劣化し、この期待を裏切ることが多い。本論文では、フローベースの連続的デノイジングを活用する言語モデルが、離散拡散モデルを品質と速度の両面で凌駕できることを示す。離散モダリティにおけるフローの基本原理を再検討し、ワンホットトークン符号化に対してユークリッド空間でのデノイジングを行うフローベース言語モデル（FLM）を構築した。本モデルは、クリーンデータの予測をクロスエントロピー目的関数で訓練可能であり、訓練の安定性と生成品質を大幅に改善する単純な時間再パラメータ化を導入する。FLMをその関連フローマップに蒸留することで、数ステップ生成が可能な蒸留フローマップ言語モデル（FMLM）を得る。LM1BおよびOWT言語データセットにおいて、FLMは最先端の離散拡散モデルに匹敵する生成品質を達成する。FMLMを用いた我々の手法は、最近の数ステップ言語モデルを一貫して上回り、ワンステップ生成においてそれら8ステップの品質を超える性能を示す。本研究は、離散モダリティにおける生成的モデリングには離散拡散過程が不可欠であるという広く信じられた仮説に疑問を投げかけ、大規模な高速フローベース言語モデリングへの道を開くものである。コードはhttps://github.com/david3684/flm で公開されている。

English

Language models based on discrete diffusion have attracted widespread interest for their potential to provide faster generation than autoregressive models. In practice, however, they exhibit a sharp degradation of sample quality in the few-step regime, failing to realize this promise. Here we show that language models leveraging flow-based continuous denoising can outperform discrete diffusion in both quality and speed. By revisiting the fundamentals of flows over discrete modalities, we build a flow-based language model (FLM) that performs Euclidean denoising over one-hot token encodings. We show that the model can be trained by predicting the clean data via a cross entropy objective, where we introduce a simple time reparameterization that greatly improves training stability and generation quality. By distilling FLM into its associated flow map, we obtain a distilled flow map language model (FMLM) capable of few-step generation. On the LM1B and OWT language datasets, FLM attains generation quality matching state-of-the-art discrete diffusion models. With FMLM, our approach outperforms recent few-step language models across the board, with one-step generation exceeding their 8-step quality. Our work calls into question the widely held hypothesis that discrete diffusion processes are necessary for generative modeling over discrete modalities, and paves the way toward accelerated flow-based language modeling at scale. Code is available at https://github.com/david3684/flm.

一段階連続的デノイジングによる言語モデリング

One-step Language Modeling via Continuous Denoising

要旨

Support