확산 모델을 활용한 프로그램 기반 이미지 편집

초록

확산 모델은 텍스트-이미지 생성 분야에서 놀라운 성과를 거두었지만, 지시 기반 이미지 편집에서는 상당한 어려움에 직면하고 있습니다. 우리의 연구는 이러한 모델들이 특히 레이아웃 변경과 같은 구조적으로 불일치하는 편집에서 어려움을 겪는다는 핵심 문제를 강조합니다. 이러한 격차를 해소하기 위해, 우리는 Diffusion Transformer(DiT) 아키텍처를 기반으로 한 통합 이미지 편집 프레임워크인 Image Editing As Programs(IEAP)를 소개합니다. IEAP의 핵심은 복잡한 편집 지시를 원자적 작업의 시퀀스로 분해하는 환원론적 접근을 통해 지시 기반 편집을 수행하는 것입니다. 각 작업은 동일한 DiT 백본을 공유하는 경량 어댑터를 통해 구현되며, 특정 유형의 편집에 특화되어 있습니다. 이러한 작업들은 비전-언어 모델(VLM) 기반 에이전트에 의해 프로그래밍되며, 임의적이고 구조적으로 불일치하는 변환을 협력적으로 지원합니다. 이러한 방식으로 편집을 모듈화하고 순차화함으로써, IEAP는 단순한 조정부터 상당한 구조적 변경에 이르기까지 다양한 편집 작업에 걸쳐 강력하게 일반화됩니다. 광범위한 실험을 통해 IEAP가 다양한 편집 시나리오에서 표준 벤치마크에 대해 최첨단 방법들을 크게 능가함을 입증했습니다. 이러한 평가에서 우리의 프레임워크는 특히 복잡한 다단계 지시에 대해 우수한 정확도와 의미적 충실도를 제공합니다. 코드는 https://github.com/YujiaHu1109/IEAP에서 확인할 수 있습니다.

English

While diffusion models have achieved remarkable success in text-to-image generation, they encounter significant challenges with instruction-driven image editing. Our research highlights a key challenge: these models particularly struggle with structurally inconsistent edits that involve substantial layout changes. To mitigate this gap, we introduce Image Editing As Programs (IEAP), a unified image editing framework built upon the Diffusion Transformer (DiT) architecture. At its core, IEAP approaches instructional editing through a reductionist lens, decomposing complex editing instructions into sequences of atomic operations. Each operation is implemented via a lightweight adapter sharing the same DiT backbone and is specialized for a specific type of edit. Programmed by a vision-language model (VLM)-based agent, these operations collaboratively support arbitrary and structurally inconsistent transformations. By modularizing and sequencing edits in this way, IEAP generalizes robustly across a wide range of editing tasks, from simple adjustments to substantial structural changes. Extensive experiments demonstrate that IEAP significantly outperforms state-of-the-art methods on standard benchmarks across various editing scenarios. In these evaluations, our framework delivers superior accuracy and semantic fidelity, particularly for complex, multi-step instructions. Codes are available at https://github.com/YujiaHu1109/IEAP.

확산 모델을 활용한 프로그램 기반 이미지 편집

Image Editing As Programs with Diffusion Models

초록

Support