aMUSEd: 오픈 소스 MUSE 재현

초록

우리는 MUSE를 기반으로 한 오픈소스 경량 마스크 이미지 모델(MIM)인 aMUSEd를 소개합니다. aMUSEd는 MUSE의 매개변수 중 10%만을 사용하며, 빠른 이미지 생성에 초점을 맞추고 있습니다. 우리는 텍스트-이미지 생성의 주류 접근법인 잠재 확산(latent diffusion)에 비해 MIM이 상대적으로 덜 탐구되었다고 믿습니다. 잠재 확산과 비교했을 때, MIM은 더 적은 추론 단계를 필요로 하며 해석 가능성이 더 높습니다. 또한, MIM은 단일 이미지로도 추가 스타일을 학습하도록 미세 조정(fine-tuning)할 수 있습니다. 우리는 대규모 텍스트-이미지 생성에서의 효과성을 입증하고 재현 가능한 학습 코드를 공개함으로써 MIM에 대한 추가 탐구를 촉진하고자 합니다. 또한, 256x256 및 512x512 해상도의 이미지를 직접 생성하는 두 모델의 체크포인트도 공개합니다.

English

We present aMUSEd, an open-source, lightweight masked image model (MIM) for text-to-image generation based on MUSE. With 10 percent of MUSE's parameters, aMUSEd is focused on fast image generation. We believe MIM is under-explored compared to latent diffusion, the prevailing approach for text-to-image generation. Compared to latent diffusion, MIM requires fewer inference steps and is more interpretable. Additionally, MIM can be fine-tuned to learn additional styles with only a single image. We hope to encourage further exploration of MIM by demonstrating its effectiveness on large-scale text-to-image generation and releasing reproducible training code. We also release checkpoints for two models which directly produce images at 256x256 and 512x512 resolutions.

aMUSEd: 오픈 소스 MUSE 재현

aMUSEd: An Open MUSE Reproduction

초록

Support