生成モデリングのための因果拡散トランスフォーマー

要旨

Causal Diffusionを、Diffusionモデルの自己回帰（AR）対応として紹介します。これは、離散および連続のモダリティに対応し、既存のLLaMAやGPTなどの次トークン予測モデルと互換性があり、次トークンの予測フレームワークです。最近の研究では、DiffusionとARモデルを組み合わせようとする試みがありますが、私たちは拡散モデルに順次因子分解を導入することで、その性能を大幅に向上させ、ARと拡散生成モードのスムーズな移行を可能にすることを示します。したがって、私たちはCausalFusionを提案します。これは、シーケンシャルトークンと拡散ノイズレベルをデュアル因子分解するデコーダー専用トランスフォーマーであり、ImageNet生成ベンチマークで最先端の結果を達成し、コンテキスト推論のために任意の数のトークンを生成するARの利点も享受します。さらに、CausalFusionの多モーダル機能を示すために、画像生成とキャプショニングモデルを共同で使用し、CausalFusionのゼロショットのコンテキスト内画像操作能力を紹介します。この研究が、離散および連続データにわたる多モーダルモデルのトレーニングに新しい視点を提供できれば幸いです。

English

We introduce Causal Diffusion as the autoregressive (AR) counterpart of Diffusion models. It is a next-token(s) forecasting framework that is friendly to both discrete and continuous modalities and compatible with existing next-token prediction models like LLaMA and GPT. While recent works attempt to combine diffusion with AR models, we show that introducing sequential factorization to a diffusion model can substantially improve its performance and enables a smooth transition between AR and diffusion generation modes. Hence, we propose CausalFusion - a decoder-only transformer that dual-factorizes data across sequential tokens and diffusion noise levels, leading to state-of-the-art results on the ImageNet generation benchmark while also enjoying the AR advantage of generating an arbitrary number of tokens for in-context reasoning. We further demonstrate CausalFusion's multimodal capabilities through a joint image generation and captioning model, and showcase CausalFusion's ability for zero-shot in-context image manipulations. We hope that this work could provide the community with a fresh perspective on training multimodal models over discrete and continuous data.

生成モデリングのための因果拡散トランスフォーマー

Causal Diffusion Transformers for Generative Modeling

要旨

Support