ProEdit:基於反轉的提示詞編輯技術正確實現
ProEdit: Inversion-based Editing From Prompts Done Right
December 26, 2025
作者: Zhi Ouyang, Dian Zheng, Xiao-Ming Wu, Jian-Jian Jiang, Kun-Yu Lin, Jingke Meng, Wei-Shi Zheng
cs.AI
摘要
基於反轉的視覺編輯技術提供了一種無需訓練即可根據用戶指令編輯圖像或影片的有效方法。現有方法通常會在取樣過程中注入源圖像信息以維持編輯一致性,然而這種取樣策略過度依賴源信息,反而對目標圖像的編輯產生負面影響(例如無法按指令改變主體的姿態、數量或顏色等屬性)。本研究提出ProEdit方法,從注意力機制與潛在表徵兩個層面解決此問題。在注意力層面,我們引入KV混合機制,在編輯區域混合源與目標的鍵值特徵,既能減輕源圖像對編輯區域的影響,又能保持背景一致性。在潛在表徵層面,我們提出潛在偏移技術,通過擾動源潛在表徵的編輯區域,消除反轉潛在表徵對取樣過程的影響。在多個圖像與影片編輯基準測試上的大量實驗表明,本方法達到了當前最先進的性能。此外,我們的設計具備即插即用特性,可無縫整合至現有反轉與編輯方法(如RF-Solver、FireFlow和UniEdit)中。
English
Inversion-based visual editing provides an effective and training-free way to edit an image or a video based on user instructions. Existing methods typically inject source image information during the sampling process to maintain editing consistency. However, this sampling strategy overly relies on source information, which negatively affects the edits in the target image (e.g., failing to change the subject's atributes like pose, number, or color as instructed). In this work, we propose ProEdit to address this issue both in the attention and the latent aspects. In the attention aspect, we introduce KV-mix, which mixes KV features of the source and the target in the edited region, mitigating the influence of the source image on the editing region while maintaining background consistency. In the latent aspect, we propose Latents-Shift, which perturbs the edited region of the source latent, eliminating the influence of the inverted latent on the sampling. Extensive experiments on several image and video editing benchmarks demonstrate that our method achieves SOTA performance. In addition, our design is plug-and-play, which can be seamlessly integrated into existing inversion and editing methods, such as RF-Solver, FireFlow and UniEdit.