統一マルチモーダル離散拡散

要旨

複数のモダリティを理解し生成できるマルチモーダル生成モデルは、現在、トークンを左から右、または上から下へと順次処理する自己回帰（AR）アプローチが主流となっています。これらのモデルは、画像キャプショニング、質問応答、画像生成などのさまざまなタスクにおいて、画像、テキスト、動画、音声を統合的に扱います。本研究では、テキスト生成における最近の成功を基盤として、テキストと画像の領域における統一的な生成手法として離散拡散モデルを探求します。離散拡散モデルは、ARモデルに比べていくつかの利点を提供します。これには、生成サンプルの品質と多様性の制御の向上、テキストと画像の両領域にわたる共同マルチモーダルインペインティングの能力、ガイダンスを通じた生成の制御性の向上などが含まれます。これらの利点を活用し、我々は初の統一マルチモーダル離散拡散モデル（UniDisc）を提案します。このモデルは、さまざまな下流タスクにおいてテキストと画像を共同で理解し生成することが可能です。UniDiscをマルチモーダルARモデルと比較し、スケーリング分析を行い、UniDiscが性能と推論時の計算効率、制御性、編集性、インペインティング、推論時間と生成品質の柔軟なトレードオフにおいて優れていることを示します。コードと追加のビジュアライゼーションはhttps://unidisc.github.ioで公開されています。

English

Multimodal generative models that can understand and generate across multiple modalities are dominated by autoregressive (AR) approaches, which process tokens sequentially from left to right, or top to bottom. These models jointly handle images, text, video, and audio for various tasks such as image captioning, question answering, and image generation. In this work, we explore discrete diffusion models as a unified generative formulation in the joint text and image domain, building upon their recent success in text generation. Discrete diffusion models offer several advantages over AR models, including improved control over quality versus diversity of generated samples, the ability to perform joint multimodal inpainting (across both text and image domains), and greater controllability in generation through guidance. Leveraging these benefits, we present the first Unified Multimodal Discrete Diffusion (UniDisc) model which is capable of jointly understanding and generating text and images for a variety of downstream tasks. We compare UniDisc to multimodal AR models, performing a scaling analysis and demonstrating that UniDisc outperforms them in terms of both performance and inference-time compute, enhanced controllability, editability, inpainting, and flexible trade-off between inference time and generation quality. Code and additional visualizations are available at https://unidisc.github.io.