aMUSEd：一个开放的MUSE复制品

摘要

我们提出了aMUSEd，这是一个基于MUSE的开源、轻量级的遮蔽图像模型（MIM），用于文本到图像的生成。aMUSEd仅使用了MUSE参数的10%，专注于快速图像生成。相较于潜在扩散这一主流文本到图像生成方法，我们认为MIM领域尚未得到充分探索。与潜在扩散相比，MIM需要更少的推理步骤，并且更具可解释性。此外，MIM可以通过仅使用单个图像进行微调来学习额外的风格。我们希望通过展示MIM在大规模文本到图像生成任务上的有效性，并发布可复现的训练代码，鼓励进一步探索MIM。我们还发布了两个模型的检查点，这两个模型可以直接生成分辨率为256x256和512x512的图像。

English

We present aMUSEd, an open-source, lightweight masked image model (MIM) for text-to-image generation based on MUSE. With 10 percent of MUSE's parameters, aMUSEd is focused on fast image generation. We believe MIM is under-explored compared to latent diffusion, the prevailing approach for text-to-image generation. Compared to latent diffusion, MIM requires fewer inference steps and is more interpretable. Additionally, MIM can be fine-tuned to learn additional styles with only a single image. We hope to encourage further exploration of MIM by demonstrating its effectiveness on large-scale text-to-image generation and releasing reproducible training code. We also release checkpoints for two models which directly produce images at 256x256 and 512x512 resolutions.

aMUSEd：一个开放的MUSE复制品

aMUSEd: An Open MUSE Reproduction

摘要

Support