MUMU: テキストから画像生成データを用いたマルチモーダル画像生成のブートストラップ

要旨

私たちは、テキストと画像が交互に配置された「<男性の写真>男性とその<犬の写真>犬が<漫画の写真>アニメ風に描かれた」といったマルチモーダルプロンプトから画像を生成するモデルを訓練しました。合成生成されたテキスト画像データと公開されているテキスト画像データのキャプションに対応する意味的に意味のある画像クロップを抽出することで、マルチモーダルデータセットをブートストラップします。私たちのモデル、MUMUは、ビジョン言語モデルエンコーダと拡散デコーダで構成され、単一の8xH100 GPUノードで訓練されます。同じ画像からのクロップのみで訓練されているにもかかわらず、MUMUは異なる画像からの入力を一貫した出力に構成することを学習します。例えば、リアルな人物と漫画の入力は、その人物を漫画風に出力し、立っている被写体とスクーターの入力は、その被写体がスクーターに乗っている様子を出力します。その結果、私たちのモデルはスタイル転送やキャラクターの一貫性といったタスクに一般化します。私たちの結果は、マルチモーダルモデルを画像生成の汎用コントローラーとして使用する可能性を示しています。

English

We train a model to generate images from multimodal prompts of interleaved text and images such as "a <picture of a man> man and his <picture of a dog> dog in an <picture of a cartoon> animated style." We bootstrap a multimodal dataset by extracting semantically meaningful image crops corresponding to words in the image captions of synthetically generated and publicly available text-image data. Our model, MUMU, is composed of a vision-language model encoder with a diffusion decoder and is trained on a single 8xH100 GPU node. Despite being only trained on crops from the same image, MUMU learns to compose inputs from different images into a coherent output. For example, an input of a realistic person and a cartoon will output the same person in the cartoon style, and an input of a standing subject and a scooter will output the subject riding the scooter. As a result, our model generalizes to tasks such as style transfer and character consistency. Our results show the promise of using multimodal models as general purpose controllers for image generation.

MUMU: テキストから画像生成データを用いたマルチモーダル画像生成のブートストラップ

MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data

要旨

Support