aMUSEd：一個開放的MUSE複製

摘要

我們提出了aMUSEd，一個基於MUSE的開源、輕量級遮罩圖像模型（MIM），用於文本到圖像生成。aMUSEd只使用了MUSE 10% 的參數，專注於快速圖像生成。我們認為相對於潛在擴散，MIM 在文本到圖像生成中尚未得到充分探索。與潛在擴散相比，MIM 需要更少的推論步驟並且更具可解釋性。此外，MIM 可以進行微調以僅使用單張圖像學習額外風格。我們希望通過展示其在大規模文本到圖像生成中的有效性並釋出可重現的訓練代碼，鼓勵進一步探索MIM。我們還釋出兩個模型的檢查點，可以直接生成256x256和512x512分辨率的圖像。

English

We present aMUSEd, an open-source, lightweight masked image model (MIM) for text-to-image generation based on MUSE. With 10 percent of MUSE's parameters, aMUSEd is focused on fast image generation. We believe MIM is under-explored compared to latent diffusion, the prevailing approach for text-to-image generation. Compared to latent diffusion, MIM requires fewer inference steps and is more interpretable. Additionally, MIM can be fine-tuned to learn additional styles with only a single image. We hope to encourage further exploration of MIM by demonstrating its effectiveness on large-scale text-to-image generation and releasing reproducible training code. We also release checkpoints for two models which directly produce images at 256x256 and 512x512 resolutions.

aMUSEd：一個開放的MUSE複製

aMUSEd: An Open MUSE Reproduction

摘要

Support