ChatPaper.aiChatPaper

ImageBrush:學習視覺上下文指示以進行基於示例的圖像操作

ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation

August 2, 2023
作者: Yasheng Sun, Yifan Yang, Houwen Peng, Yifei Shen, Yuqing Yang, Han Hu, Lili Qiu, Hideki Koike
cs.AI

摘要

儘管語言引導的圖像操作取得了顯著進展,但如何準確指導操作過程以忠實反映人類意圖的挑戰仍然存在。使用自然語言對操作任務進行準確而全面的描述是費時的,有時甚至是不可能的,主要是由於語言表達中存在的固有不確定性和模棱兩可性。在不倚賴外部跨模態語言信息的情況下完成圖像操作是否可行?如果存在這種可能性,固有的模態差距將輕鬆消除。在本文中,我們提出了一種新穎的操作方法,名為ImageBrush,它學習視覺指令以進行更準確的圖像編輯。我們的關鍵想法是使用一對轉換圖像作為視覺指令,這不僅能準確捕捉人類意圖,還有助於在現實場景中的可訪問性。捕捉視覺指令特別具有挑戰性,因為它涉及僅從視覺演示中提取潛在意圖,然後將此操作應用於新圖像。為應對這一挑戰,我們將視覺指令學習定義為基於擴散的修補問題,通過生成的迭代過程充分利用上下文信息。精心設計了視覺提示編碼器,以增強模型在揭示視覺指令背後的人類意圖方面的能力。大量實驗表明,我們的方法生成引人入勝的操作結果,符合演示中所涉及的轉換。此外,我們的模型展現出對各種下游任務的強大泛化能力,如姿勢轉移、圖像翻譯和視頻修補。
English
While language-guided image manipulation has made remarkable progress, the challenge of how to instruct the manipulation process faithfully reflecting human intentions persists. An accurate and comprehensive description of a manipulation task using natural language is laborious and sometimes even impossible, primarily due to the inherent uncertainty and ambiguity present in linguistic expressions. Is it feasible to accomplish image manipulation without resorting to external cross-modal language information? If this possibility exists, the inherent modality gap would be effortlessly eliminated. In this paper, we propose a novel manipulation methodology, dubbed ImageBrush, that learns visual instructions for more accurate image editing. Our key idea is to employ a pair of transformation images as visual instructions, which not only precisely captures human intention but also facilitates accessibility in real-world scenarios. Capturing visual instructions is particularly challenging because it involves extracting the underlying intentions solely from visual demonstrations and then applying this operation to a new image. To address this challenge, we formulate visual instruction learning as a diffusion-based inpainting problem, where the contextual information is fully exploited through an iterative process of generation. A visual prompting encoder is carefully devised to enhance the model's capacity in uncovering human intent behind the visual instructions. Extensive experiments show that our method generates engaging manipulation results conforming to the transformations entailed in demonstrations. Moreover, our model exhibits robust generalization capabilities on various downstream tasks such as pose transfer, image translation and video inpainting.
PDF130December 15, 2024