aMUSEd:一个开放的MUSE复制品
aMUSEd: An Open MUSE Reproduction
January 3, 2024
作者: Suraj Patil, William Berman, Robin Rombach, Patrick von Platen
cs.AI
摘要
我们提出了aMUSEd,这是一个基于MUSE的开源、轻量级的遮蔽图像模型(MIM),用于文本到图像的生成。aMUSEd仅使用了MUSE参数的10%,专注于快速图像生成。相较于潜在扩散这一主流文本到图像生成方法,我们认为MIM领域尚未得到充分探索。与潜在扩散相比,MIM需要更少的推理步骤,并且更具可解释性。此外,MIM可以通过仅使用单个图像进行微调来学习额外的风格。我们希望通过展示MIM在大规模文本到图像生成任务上的有效性,并发布可复现的训练代码,鼓励进一步探索MIM。我们还发布了两个模型的检查点,这两个模型可以直接生成分辨率为256x256和512x512的图像。
English
We present aMUSEd, an open-source, lightweight masked image model (MIM) for
text-to-image generation based on MUSE. With 10 percent of MUSE's parameters,
aMUSEd is focused on fast image generation. We believe MIM is under-explored
compared to latent diffusion, the prevailing approach for text-to-image
generation. Compared to latent diffusion, MIM requires fewer inference steps
and is more interpretable. Additionally, MIM can be fine-tuned to learn
additional styles with only a single image. We hope to encourage further
exploration of MIM by demonstrating its effectiveness on large-scale
text-to-image generation and releasing reproducible training code. We also
release checkpoints for two models which directly produce images at 256x256 and
512x512 resolutions.