ProEdit: Edição Baseada em Inversão a Partir de Prompts Feita da Maneira Correta

Resumo

A edição visual baseada em inversão oferece uma forma eficaz e livre de treinamento para editar uma imagem ou vídeo com base nas instruções do utilizador. Os métodos existentes normalmente injetam informações da imagem fonte durante o processo de amostragem para manter a consistência da edição. No entanto, esta estratégia de amostragem depende excessivamente da informação fonte, o que afeta negativamente as edições na imagem alvo (por exemplo, falhando em alterar atributos do sujeito como pose, número ou cor, conforme instruído). Neste trabalho, propomos o ProEdit para abordar esta questão tanto ao nível da atenção como ao nível latente. No aspeto da atenção, introduzimos o KV-mix, que mistura características KV (chave-valor) da fonte e do alvo na região editada, mitigando a influência da imagem fonte na região de edição, mantendo a consistência do fundo. No aspeto latente, propomos o Latents-Shift, que perturba a região editada do latente fonte, eliminando a influência do latente invertido na amostragem. Extensos experimentos em várias referências de edição de imagem e vídeo demonstram que o nosso método alcança um desempenho de estado da arte (SOTA). Além disso, o nosso design é plug-and-play, podendo ser integrado de forma transparente em métodos de inversão e edição existentes, como RF-Solver, FireFlow e UniEdit.

English

Inversion-based visual editing provides an effective and training-free way to edit an image or a video based on user instructions. Existing methods typically inject source image information during the sampling process to maintain editing consistency. However, this sampling strategy overly relies on source information, which negatively affects the edits in the target image (e.g., failing to change the subject's atributes like pose, number, or color as instructed). In this work, we propose ProEdit to address this issue both in the attention and the latent aspects. In the attention aspect, we introduce KV-mix, which mixes KV features of the source and the target in the edited region, mitigating the influence of the source image on the editing region while maintaining background consistency. In the latent aspect, we propose Latents-Shift, which perturbs the edited region of the source latent, eliminating the influence of the inverted latent on the sampling. Extensive experiments on several image and video editing benchmarks demonstrate that our method achieves SOTA performance. In addition, our design is plug-and-play, which can be seamlessly integrated into existing inversion and editing methods, such as RF-Solver, FireFlow and UniEdit.