MUMU:從文本到圖像數據啟動多模態圖像生成
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data
June 26, 2024
作者: William Berman, Alexander Peysakhovich
cs.AI
摘要
我們訓練了一個模型,從交錯的文本和圖像的多模式提示中生成圖像,例如“一個<圖片中的男人>男人和他的<圖片中的狗>狗以<卡通圖片>動畫風格呈現”。我們通過從合成生成的和公開可用的文本-圖像數據的圖像標題中提取語義上有意義的圖像裁剪,來引導一個多模式數據集的啟動。我們的模型MUMU由一個具有擴散解碼器的視覺-語言模型編碼器組成,並在單個8xH100 GPU節點上進行訓練。儘管只訓練於來自同一圖像的裁剪,MUMU學會將來自不同圖像的輸入組合成一個連貫的輸出。例如,一個包含真實人物和卡通的輸入將輸出相同人物的卡通風格,一個包含站立主題和滑板車的輸入將輸出主題騎著滑板車的畫面。因此,我們的模型推廣到風格轉換和角色一致性等任務。我們的結果展示了使用多模式模型作為圖像生成的通用控制器的潛力。
English
We train a model to generate images from multimodal prompts of interleaved
text and images such as "a <picture of a man> man and his <picture of a dog>
dog in an <picture of a cartoon> animated style." We bootstrap a multimodal
dataset by extracting semantically meaningful image crops corresponding to
words in the image captions of synthetically generated and publicly available
text-image data. Our model, MUMU, is composed of a vision-language model
encoder with a diffusion decoder and is trained on a single 8xH100 GPU node.
Despite being only trained on crops from the same image, MUMU learns to compose
inputs from different images into a coherent output. For example, an input of a
realistic person and a cartoon will output the same person in the cartoon
style, and an input of a standing subject and a scooter will output the subject
riding the scooter. As a result, our model generalizes to tasks such as style
transfer and character consistency. Our results show the promise of using
multimodal models as general purpose controllers for image generation.Summary
AI-Generated Summary