DART: スケーラブルなテキストから画像へのノイズ除去自己回帰トランスフォーマー

要旨

拡散モデルは、視覚生成における主要なアプローチとなっています。これらは、入力に徐々にノイズを加えるマルコフ過程をノイズ除去することで訓練されます。我々は、マルコフ性質がモデルが生成軌跡を十分に活用する能力を制限し、訓練および推論中の効率を低下させると主張します。本論文では、非マルコフなフレームワーク内で自己回帰（AR）と拡散を統合する、トランスフォーマーベースのモデルであるDARTを提案します。DARTは、標準言語モデルと同じアーキテクチャを持つARモデルを使用して、画像パッチを空間的およびスペクトル的に反復的にノイズ除去します。DARTは画像の量子化に依存せず、柔軟性を維持しながらより効果的な画像モデリングを実現します。さらに、DARTはテキストと画像データの両方を統一されたモデルでシームレスに訓練します。当社の手法は、クラス条件付きおよびテキストから画像への生成タスクで競争力のあるパフォーマンスを示し、従来の拡散モデルに対するスケーラブルで効率的な代替手段を提供します。この統一されたフレームワークを通じて、DARTはスケーラブルで高品質な画像合成の新たな基準を確立します。

English

Diffusion models have become the dominant approach for visual generation. They are trained by denoising a Markovian process that gradually adds noise to the input. We argue that the Markovian property limits the models ability to fully utilize the generation trajectory, leading to inefficiencies during training and inference. In this paper, we propose DART, a transformer-based model that unifies autoregressive (AR) and diffusion within a non-Markovian framework. DART iteratively denoises image patches spatially and spectrally using an AR model with the same architecture as standard language models. DART does not rely on image quantization, enabling more effective image modeling while maintaining flexibility. Furthermore, DART seamlessly trains with both text and image data in a unified model. Our approach demonstrates competitive performance on class-conditioned and text-to-image generation tasks, offering a scalable, efficient alternative to traditional diffusion models. Through this unified framework, DART sets a new benchmark for scalable, high-quality image synthesis.

DART: スケーラブルなテキストから画像へのノイズ除去自己回帰トランスフォーマー

DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation

要旨

Support