aMUSEd: オープンMUSE再現

要旨

我々は、MUSEを基盤としたテキストから画像を生成するためのオープンソースで軽量なマスク画像モデル（MIM）であるaMUSEdを紹介します。aMUSEdはMUSEのパラメータ数の10％で構成され、高速な画像生成に焦点を当てています。我々は、テキストから画像生成の主流である潜在拡散モデルと比較して、MIMが十分に探求されていないと考えています。潜在拡散モデルと比べ、MIMは推論ステップが少なく、より解釈可能です。さらに、MIMは単一の画像だけで追加のスタイルを学習するように微調整することができます。我々は、大規模なテキストから画像生成におけるMIMの有効性を実証し、再現可能なトレーニングコードを公開することで、MIMのさらなる探求を促進したいと考えています。また、256x256および512x512解像度で直接画像を生成する2つのモデルのチェックポイントも公開します。

English

We present aMUSEd, an open-source, lightweight masked image model (MIM) for text-to-image generation based on MUSE. With 10 percent of MUSE's parameters, aMUSEd is focused on fast image generation. We believe MIM is under-explored compared to latent diffusion, the prevailing approach for text-to-image generation. Compared to latent diffusion, MIM requires fewer inference steps and is more interpretable. Additionally, MIM can be fine-tuned to learn additional styles with only a single image. We hope to encourage further exploration of MIM by demonstrating its effectiveness on large-scale text-to-image generation and releasing reproducible training code. We also release checkpoints for two models which directly produce images at 256x256 and 512x512 resolutions.

aMUSEd: オープンMUSE再現

aMUSEd: An Open MUSE Reproduction

要旨

Support