ChatPaper.aiChatPaper

ImageBrush:学习基于示例的图像操作的视觉上下文指导

ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation

August 2, 2023
作者: Yasheng Sun, Yifan Yang, Houwen Peng, Yifei Shen, Yuqing Yang, Han Hu, Lili Qiu, Hideki Koike
cs.AI

摘要

尽管语言引导的图像操作取得了显著进展,但如何准确地指导操作过程以忠实地反映人类意图的挑战仍然存在。使用自然语言对操作任务进行准确和全面的描述是费时的,有时甚至是不可能的,主要是由于语言表达中存在的固有不确定性和歧义。在不借助外部跨模态语言信息的情况下完成图像操作是否可行?如果存在这种可能性,固有的模态差距将被轻松消除。在本文中,我们提出了一种新颖的操作方法,名为ImageBrush,它学习用于更准确图像编辑的视觉指导。我们的关键思想是利用一对转换图像作为视觉指导,这不仅能精确捕捉人类意图,还能在现实场景中提供便利。捕捉视觉指导尤其具有挑战性,因为它涉及仅从视觉演示中提取潜在意图,然后将此操作应用于新图像。为了解决这一挑战,我们将视觉指导学习形式化为基于扩散的修补问题,通过生成的迭代过程充分利用上下文信息。精心设计了视觉提示编码器,以增强模型揭示视觉指导背后的人类意图的能力。大量实验表明,我们的方法生成了引人入胜的操作结果,符合演示中涉及的转换。此外,我们的模型在各种下游任务上表现出强大的泛化能力,如姿势转移、图像翻译和视频修补。
English
While language-guided image manipulation has made remarkable progress, the challenge of how to instruct the manipulation process faithfully reflecting human intentions persists. An accurate and comprehensive description of a manipulation task using natural language is laborious and sometimes even impossible, primarily due to the inherent uncertainty and ambiguity present in linguistic expressions. Is it feasible to accomplish image manipulation without resorting to external cross-modal language information? If this possibility exists, the inherent modality gap would be effortlessly eliminated. In this paper, we propose a novel manipulation methodology, dubbed ImageBrush, that learns visual instructions for more accurate image editing. Our key idea is to employ a pair of transformation images as visual instructions, which not only precisely captures human intention but also facilitates accessibility in real-world scenarios. Capturing visual instructions is particularly challenging because it involves extracting the underlying intentions solely from visual demonstrations and then applying this operation to a new image. To address this challenge, we formulate visual instruction learning as a diffusion-based inpainting problem, where the contextual information is fully exploited through an iterative process of generation. A visual prompting encoder is carefully devised to enhance the model's capacity in uncovering human intent behind the visual instructions. Extensive experiments show that our method generates engaging manipulation results conforming to the transformations entailed in demonstrations. Moreover, our model exhibits robust generalization capabilities on various downstream tasks such as pose transfer, image translation and video inpainting.
PDF130December 15, 2024