MUMU：從文本到圖像數據啟動多模態圖像生成

摘要

我們訓練了一個模型，從交錯的文本和圖像的多模式提示中生成圖像，例如“一個<圖片中的男人>男人和他的<圖片中的狗>狗以<卡通圖片>動畫風格呈現”。我們通過從合成生成的和公開可用的文本-圖像數據的圖像標題中提取語義上有意義的圖像裁剪，來引導一個多模式數據集的啟動。我們的模型MUMU由一個具有擴散解碼器的視覺-語言模型編碼器組成，並在單個8xH100 GPU節點上進行訓練。儘管只訓練於來自同一圖像的裁剪，MUMU學會將來自不同圖像的輸入組合成一個連貫的輸出。例如，一個包含真實人物和卡通的輸入將輸出相同人物的卡通風格，一個包含站立主題和滑板車的輸入將輸出主題騎著滑板車的畫面。因此，我們的模型推廣到風格轉換和角色一致性等任務。我們的結果展示了使用多模式模型作為圖像生成的通用控制器的潛力。

English

We train a model to generate images from multimodal prompts of interleaved text and images such as "a <picture of a man> man and his <picture of a dog> dog in an <picture of a cartoon> animated style." We bootstrap a multimodal dataset by extracting semantically meaningful image crops corresponding to words in the image captions of synthetically generated and publicly available text-image data. Our model, MUMU, is composed of a vision-language model encoder with a diffusion decoder and is trained on a single 8xH100 GPU node. Despite being only trained on crops from the same image, MUMU learns to compose inputs from different images into a coherent output. For example, an input of a realistic person and a cartoon will output the same person in the cartoon style, and an input of a standing subject and a scooter will output the subject riding the scooter. As a result, our model generalizes to tasks such as style transfer and character consistency. Our results show the promise of using multimodal models as general purpose controllers for image generation.

MUMU：從文本到圖像數據啟動多模態圖像生成

MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data

摘要

Support