ChatPaper.aiChatPaper

Instruct-Imagen:多模指导下的图像生成

Instruct-Imagen: Image Generation with Multi-modal Instruction

January 3, 2024
作者: Hexiang Hu, Kelvin C. K. Chan, Yu-Chuan Su, Wenhu Chen, Yandong Li, Kihyuk Sohn, Yang Zhao, Xue Ben, Boqing Gong, William Cohen, Ming-Wei Chang, Xuhui Jia
cs.AI

摘要

本文介绍了instruct-imagen,这是一种解决异构图像生成任务并在未见任务中实现泛化的模型。我们引入了用于图像生成的多模态指导,这是一种任务表示,可以精确表达各种生成意图。它使用自然语言来整合不同的模态(例如文本、边缘、风格、主题等),使得丰富的生成意图可以以统一的格式标准化。 然后,我们通过微调预训练的文本到图像扩散模型,构建了instruct-imagen,采用了两阶段框架。首先,我们使用检索增强训练来调整模型,以增强模型在外部多模态上下文中生成的能力。随后,我们对适应后的模型进行微调,针对需要视觉-语言理解的多样化图像生成任务(例如基于主题驱动的生成等),每个任务都配有一个包含任务本质的多模态指导。在各种图像生成数据集上进行的人类评估显示,instruct-imagen在领域内与或超过先前的特定任务模型,并展现出对未见和更复杂任务的有希望的泛化能力。
English
This paper presents instruct-imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant generation intents can be standardized in a uniform format. We then build instruct-imagen by fine-tuning a pre-trained text-to-image diffusion model with a two-stage framework. First, we adapt the model using the retrieval-augmented training, to enhance model's capabilities to ground its generation on external multimodal context. Subsequently, we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g., subject-driven generation, etc.), each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks.
PDF323December 15, 2024