ProEdit: 올바르게 수행된 프롬프트 기반 인버전 편집 기술

초록

역전 기반 시각적 편집은 사용자 지시에 따라 이미지나 비디오를 편집하는 효과적이고 학습이 필요 없는 방법을 제공합니다. 기존 방법들은 일반적으로 편집 일관성을 유지하기 위해 샘플링 과정에서 원본 이미지 정보를 주입합니다. 그러나 이러한 샘플링 전략은 원본 정보에 지나치게 의존하여 대상 이미지의 편집에 부정적인 영향을 미칩니다(예: 지시된 대로 피사체의 자세, 개수, 색상 등의 속성을 변경하지 못함). 본 연구에서는 이러한 문제를 주의 메커니즘과 잠재 공간 측면 모두에서 해결하기 위해 ProEdit을 제안합니다. 주의 메커니즘 측면에서는 편집 영역에서 원본과 대상의 KV 특징을 혼합하는 KV-mix를 도입하여 배경 일관성을 유지하면서 편집 영역에 대한 원본 이미지의 영향을 완화합니다. 잠재 공간 측면에서는 원본 잠재 변수의 편집 영역을 교란시키는 Latents-Shift를 제안하여 샘플링 과정에서 역전된 잠재 변수의 영향을 제거합니다. 여러 이미지 및 비디오 편집 벤치마크에서 진행한 폭넓은 실험을 통해 우리 방법이 SOTA 성능을 달성함을 입증했습니다. 또한 우리의 설계는 플러그 앤 플레이 방식으로, RF-Solver, FireFlow, UniEdit과 같은 기존 역전 및 편집 방법에 원활하게 통합될 수 있습니다.

English

Inversion-based visual editing provides an effective and training-free way to edit an image or a video based on user instructions. Existing methods typically inject source image information during the sampling process to maintain editing consistency. However, this sampling strategy overly relies on source information, which negatively affects the edits in the target image (e.g., failing to change the subject's atributes like pose, number, or color as instructed). In this work, we propose ProEdit to address this issue both in the attention and the latent aspects. In the attention aspect, we introduce KV-mix, which mixes KV features of the source and the target in the edited region, mitigating the influence of the source image on the editing region while maintaining background consistency. In the latent aspect, we propose Latents-Shift, which perturbs the edited region of the source latent, eliminating the influence of the inverted latent on the sampling. Extensive experiments on several image and video editing benchmarks demonstrate that our method achieves SOTA performance. In addition, our design is plug-and-play, which can be seamlessly integrated into existing inversion and editing methods, such as RF-Solver, FireFlow and UniEdit.

ProEdit: 올바르게 수행된 프롬프트 기반 인버전 편집 기술

ProEdit: Inversion-based Editing From Prompts Done Right

초록

Support