ProEdit: Modifica basata sull'inversione da prompt fatta come si deve

Abstract

L'editing visivo basato sull'inversione fornisce un metodo efficace e senza addestramento per modificare un'immagine o un video in base alle istruzioni dell'utente. I metodi esistenti iniettano tipicamente informazioni dell'immagine sorgente durante il processo di campionamento per mantenere la coerenza dell'editing. Tuttavia, questa strategia di campionamento si affida eccessivamente alle informazioni sorgente, il che influisce negativamente sulle modifiche nell'immagine target (ad esempio, fallendo nel cambiare attributi del soggetto come posa, numero o colore come richiesto). In questo lavoro, proponiamo ProEdit per affrontare questo problema sia a livello di attenzione che di latente. Nell'aspetto dell'attenzione, introduciamo KV-mix, che combina le caratteristiche KV della sorgente e del target nella regione modificata, mitigando l'influenza dell'immagine sorgente sulla regione di editing mentre mantiene la coerenza dello sfondo. Nell'aspetto latente, proponiamo Latents-Shift, che perturba la regione modificata del latente sorgente, eliminando l'influenza del latente invertito sul campionamento. Esperimenti estesi su diversi benchmark di editing di immagini e video dimostrano che il nostro metodo raggiunge prestazioni allo stato dell'arte. Inoltre, il nostro design è plug-and-play e può essere integrato perfettamente in metodi di inversione ed editing esistenti, come RF-Solver, FireFlow e UniEdit.

English

Inversion-based visual editing provides an effective and training-free way to edit an image or a video based on user instructions. Existing methods typically inject source image information during the sampling process to maintain editing consistency. However, this sampling strategy overly relies on source information, which negatively affects the edits in the target image (e.g., failing to change the subject's atributes like pose, number, or color as instructed). In this work, we propose ProEdit to address this issue both in the attention and the latent aspects. In the attention aspect, we introduce KV-mix, which mixes KV features of the source and the target in the edited region, mitigating the influence of the source image on the editing region while maintaining background consistency. In the latent aspect, we propose Latents-Shift, which perturbs the edited region of the source latent, eliminating the influence of the inverted latent on the sampling. Extensive experiments on several image and video editing benchmarks demonstrate that our method achieves SOTA performance. In addition, our design is plug-and-play, which can be seamlessly integrated into existing inversion and editing methods, such as RF-Solver, FireFlow and UniEdit.

ProEdit: Modifica basata sull'inversione da prompt fatta come si deve

ProEdit: Inversion-based Editing From Prompts Done Right

Abstract

Support