ChatPaper.aiChatPaper

Instruct-Imagen:多模指示下的圖像生成

Instruct-Imagen: Image Generation with Multi-modal Instruction

January 3, 2024
作者: Hexiang Hu, Kelvin C. K. Chan, Yu-Chuan Su, Wenhu Chen, Yandong Li, Kihyuk Sohn, Yang Zhao, Xue Ben, Boqing Gong, William Cohen, Ming-Wei Chang, Xuhui Jia
cs.AI

摘要

本文介紹了 instruct-imagen,一個應對異構圖像生成任務並在未見任務間泛化的模型。我們引入了*多模指示*用於圖像生成,這是一種任務表示,能準確表達各種生成意圖。它使用自然語言來整合不同的模態(例如文本、邊緣、風格、主題等),使得豐富的生成意圖可以以統一格式標準化。 然後,我們通過對預訓練的文本到圖像擴散模型進行微調,構建了 instruct-imagen,使用了一個兩階段框架。首先,我們使用檢索增強訓練來適應模型,以增強模型在外部多模上下文中生成的能力。隨後,我們對適應後的模型進行微調,應用於各種需要視覺-語言理解的圖像生成任務(例如基於主題的生成等),每個任務都配對一個多模指示,概括了任務的本質。對各種圖像生成數據集的人類評估顯示,instruct-imagen在領域內與先前的特定任務模型相匹敵或優越,並展示了對未見和更複雜任務的有前景的泛化能力。
English
This paper presents instruct-imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant generation intents can be standardized in a uniform format. We then build instruct-imagen by fine-tuning a pre-trained text-to-image diffusion model with a two-stage framework. First, we adapt the model using the retrieval-augmented training, to enhance model's capabilities to ground its generation on external multimodal context. Subsequently, we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g., subject-driven generation, etc.), each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks.
PDF323December 15, 2024