言語モデリングのための連続拡散モデル

要旨

離散カテゴリカルデータのモデリングにおいて、拡散モデルは自己回帰モデルに代わる有望な手法として登場しました。しかし、離散データ空間で直接動作する拡散モデルは、離散状態間の遷移中に信号が失われるため、反復的な精緻化の力を十分に活用できていません。既存の離散データ向け連続拡散モデルは、離散アプローチと比較して性能が限られており、両者の間の不明確な関連性が離散データ向け拡散モデルの発展を制約しています。本研究では、基盤となるカテゴリカル分布の幾何学を組み込んだ言語モデリング向け連続拡散モデルを提案します。統計多様体上の離散拡散と連続フローの間の関連性を確立し、この類推に基づいて、従来の離散拡散モデルを一般化する拡散プロセスのシンプルな設計を導入します。さらに、放射対称性に基づくシミュレーションフリーの学習フレームワークと、多様体の高次元性に対処するシンプルな技術を提案します。言語モデリングベンチマークや他のモダリティにおける包括的な実験により、本手法が既存の離散拡散モデルを上回り、自己回帰モデルの性能に迫ることが示されました。コードはhttps://github.com/harryjo97/RDLM{https://github.com/harryjo97/RDLM}で公開されています。

English

Diffusion models have emerged as a promising alternative to autoregressive models in modeling discrete categorical data. Yet diffusion models that directly work on discrete data space do not fully exploit the power of iterative refinement, as the signals are lost during the transition between discrete states. Existing continuous diffusion models for discrete data have limited performance compared to discrete approaches, and the unclear link between them restricts the development of diffusion models for discrete data. In this work, we propose a continuous diffusion model for language modeling that incorporates the geometry of the underlying categorical distribution. We establish a connection between the discrete diffusion and continuous flow on the statistical manifold, and building on the analogy, we introduce a simple design for the diffusion process that generalizes previous discrete diffusion models. We further propose a simulation-free training framework based on radial symmetry and a simple technique to address the high dimensionality of the manifold. Comprehensive experiments on language modeling benchmarks and other modalities show that our method outperforms existing discrete diffusion models and approaches the performance of autoregressive models. Codes available at https://github.com/harryjo97/RDLM{https://github.com/harryjo97/RDLM}.

言語モデリングのための連続拡散モデル

Continuous Diffusion Model for Language Modeling

要旨

Support