MUMU：从文本到图像数据的多模态图像生成引导

摘要

我们训练了一个模型，用于从交错的文本和图片的多模态提示中生成图片，例如“一个<图片中的男人>男人和他的<图片中的狗>狗以<图片中的卡通>卡通风格呈现”。我们通过提取与合成生成的和公开可用的文本-图像数据的图像标题中的单词相对应的语义上有意义的图像裁剪来引导一个多模态数据集的生成。我们的模型，MUMU，由一个视觉-语言模型编码器和一个扩散解码器组成，并在单个8xH100 GPU节点上进行训练。尽管仅在来自同一图像的裁剪上进行训练，MUMU 学会了将来自不同图像的输入组合成连贯的输出。例如，一个包含现实人物和卡通人物的输入将输出相同人物的卡通风格，一个包含站立主体和滑板车的输入将输出主体骑着滑板车的图像。因此，我们的模型推广到了风格转移和角色一致性等任务。我们的结果显示了使用多模态模型作为图像生成的通用控制器的潜力。

English

We train a model to generate images from multimodal prompts of interleaved text and images such as "a <picture of a man> man and his <picture of a dog> dog in an <picture of a cartoon> animated style." We bootstrap a multimodal dataset by extracting semantically meaningful image crops corresponding to words in the image captions of synthetically generated and publicly available text-image data. Our model, MUMU, is composed of a vision-language model encoder with a diffusion decoder and is trained on a single 8xH100 GPU node. Despite being only trained on crops from the same image, MUMU learns to compose inputs from different images into a coherent output. For example, an input of a realistic person and a cartoon will output the same person in the cartoon style, and an input of a standing subject and a scooter will output the subject riding the scooter. As a result, our model generalizes to tasks such as style transfer and character consistency. Our results show the promise of using multimodal models as general purpose controllers for image generation.

MUMU：从文本到图像数据的多模态图像生成引导

MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data

摘要

Support