以擴散模型實現圖像編輯的程式化方法
Image Editing As Programs with Diffusion Models
June 4, 2025
作者: Yujia Hu, Songhua Liu, Zhenxiong Tan, Xingyi Yang, Xinchao Wang
cs.AI
摘要
尽管扩散模型在文本到图像生成方面取得了显著成功,但在指令驱动的图像编辑任务中却面临重大挑战。我们的研究揭示了一个关键问题:这些模型在处理涉及大幅布局变化的结构性不一致编辑时尤为困难。为了弥补这一不足,我们提出了“图像编辑即程序”(Image Editing As Programs, IEAP),这是一个基于扩散变换器(Diffusion Transformer, DiT)架构的统一图像编辑框架。IEAP的核心在于通过还原论的视角处理指令编辑,将复杂的编辑指令分解为一系列原子操作的序列。每个操作通过共享相同DiT主干的轻量级适配器实现,并专门针对特定类型的编辑进行优化。这些操作由基于视觉语言模型(Vision-Language Model, VLM)的代理进行编程,协同支持任意且结构性不一致的变换。通过这种模块化和序列化的编辑方式,IEAP在从简单调整到重大结构变化的各种编辑任务中展现出强大的泛化能力。大量实验表明,IEAP在多种编辑场景下的标准基准测试中显著优于现有最先进的方法。在这些评估中,我们的框架尤其在处理复杂、多步骤指令时,展现出卓越的准确性和语义保真度。代码已发布于https://github.com/YujiaHu1109/IEAP。
English
While diffusion models have achieved remarkable success in text-to-image
generation, they encounter significant challenges with instruction-driven image
editing. Our research highlights a key challenge: these models particularly
struggle with structurally inconsistent edits that involve substantial layout
changes. To mitigate this gap, we introduce Image Editing As Programs (IEAP), a
unified image editing framework built upon the Diffusion Transformer (DiT)
architecture. At its core, IEAP approaches instructional editing through a
reductionist lens, decomposing complex editing instructions into sequences of
atomic operations. Each operation is implemented via a lightweight adapter
sharing the same DiT backbone and is specialized for a specific type of edit.
Programmed by a vision-language model (VLM)-based agent, these operations
collaboratively support arbitrary and structurally inconsistent
transformations. By modularizing and sequencing edits in this way, IEAP
generalizes robustly across a wide range of editing tasks, from simple
adjustments to substantial structural changes. Extensive experiments
demonstrate that IEAP significantly outperforms state-of-the-art methods on
standard benchmarks across various editing scenarios. In these evaluations, our
framework delivers superior accuracy and semantic fidelity, particularly for
complex, multi-step instructions. Codes are available at
https://github.com/YujiaHu1109/IEAP.