ChatPaper.aiChatPaper

ACE:通过扩散Transformer遵循指令的全能创作者和编辑

ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer

September 30, 2024
作者: Zhen Han, Zeyinzi Jiang, Yulin Pan, Jingfeng Zhang, Chaojie Mao, Chenwei Xie, Yu Liu, Jingren Zhou
cs.AI

摘要

扩散模型已成为一种强大的生成技术,并已被发现适用于各种场景。大多数现有的基础扩散模型主要设计用于文本引导的视觉生成,并不支持多模态条件,而这对于许多视觉编辑任务至关重要。这一限制阻碍了这些基础扩散模型在视觉生成领域中像自然语言处理领域的GPT-4那样作为统一模型的应用。在这项工作中,我们提出了ACE,一种全能创作者和编辑器,它在各种视觉生成任务中取得了与专家模型相媲美的性能。为实现这一目标,我们首先引入了一种统一的条件格式,称为长上下文条件单元(LCU),并提出了一种使用LCU作为输入的基于Transformer的新型扩散模型,旨在实现跨各种生成和编辑任务的联合训练。此外,我们提出了一种高效的数据收集方法来解决缺乏可用训练数据的问题。它涉及通过细调多模态大型语言模型,获取基于合成或聚类的流水线的成对图像,并提供这些成对图像与准确的文本指令。为了全面评估我们模型的性能,我们建立了一个手动注释的成对数据基准,涵盖各种视觉生成任务。广泛的实验结果展示了我们模型在视觉生成领域的优越性。由于我们模型的一体化功能,我们可以轻松构建一个多模态聊天系统,使用单一模型作为后端响应任何关于图像创建的交互请求,避免了通常在视觉代理中采用的繁琐流水线。代码和模型将在项目页面上提供:https://ali-vilab.github.io/ace-page/。
English
Diffusion models have emerged as a powerful generative technology and have been found to be applicable in various scenarios. Most existing foundational diffusion models are primarily designed for text-guided visual generation and do not support multi-modal conditions, which are essential for many visual editing tasks. This limitation prevents these foundational diffusion models from serving as a unified model in the field of visual generation, like GPT-4 in the natural language processing field. In this work, we propose ACE, an All-round Creator and Editor, which achieves comparable performance compared to those expert models in a wide range of visual generation tasks. To achieve this goal, we first introduce a unified condition format termed Long-context Condition Unit (LCU), and propose a novel Transformer-based diffusion model that uses LCU as input, aiming for joint training across various generation and editing tasks. Furthermore, we propose an efficient data collection approach to address the issue of the absence of available training data. It involves acquiring pairwise images with synthesis-based or clustering-based pipelines and supplying these pairs with accurate textual instructions by leveraging a fine-tuned multi-modal large language model. To comprehensively evaluate the performance of our model, we establish a benchmark of manually annotated pairs data across a variety of visual generation tasks. The extensive experimental results demonstrate the superiority of our model in visual generation fields. Thanks to the all-in-one capabilities of our model, we can easily build a multi-modal chat system that responds to any interactive request for image creation using a single model to serve as the backend, avoiding the cumbersome pipeline typically employed in visual agents. Code and models will be available on the project page: https://ali-vilab.github.io/ace-page/.

Summary

AI-Generated Summary

PDF122November 13, 2024