LazyDrag:通過顯式對應實現多模態擴散變換器上的穩定拖拽式編輯
LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence
September 15, 2025
作者: Zixin Yin, Xili Dai, Duomin Wang, Xianfang Zeng, Lionel M. Ni, Gang Yu, Heung-Yeung Shum
cs.AI
摘要
基於注意力機制的隱式點匹配已成為拖拽式編輯的核心瓶頸,導致了反轉強度減弱和測試時優化(TTO)成本高昂的根本性妥協。這一妥協嚴重限制了擴散模型的生成能力,抑制了高保真度的圖像修復和文本引導創作。本文中,我們提出了LazyDrag,這是首個針對多模態擴散變壓器的拖拽式圖像編輯方法,它直接消除了對隱式點匹配的依賴。具體而言,我們的方法從用戶拖拽輸入中生成顯式對應映射,作為增強注意力控制的可靠參考。這一可靠參考為穩定的全強度反轉過程開闢了可能性,這在拖拽式編輯任務中尚屬首次。它消除了TTO的必要性,並釋放了模型的生成潛力。因此,LazyDrag自然統一了精確的幾何控制與文本引導,實現了以往難以企及的複雜編輯:如打開狗的嘴巴並修復其內部,生成新物體如“網球”,或對於模糊的拖拽,做出上下文感知的改變,如將手移入口袋。此外,LazyDrag支持多輪工作流程,可同時進行移動和縮放操作。在DragBench上的評估顯示,我們的方法在拖拽準確性和感知質量上均優於基線,這得到了VIEScore和人類評估的驗證。LazyDrag不僅建立了新的性能標杆,還為編輯範式開闢了新路徑。
English
The reliance on implicit point matching via attention has become a core
bottleneck in drag-based editing, resulting in a fundamental compromise on
weakened inversion strength and costly test-time optimization (TTO). This
compromise severely limits the generative capabilities of diffusion models,
suppressing high-fidelity inpainting and text-guided creation. In this paper,
we introduce LazyDrag, the first drag-based image editing method for
Multi-Modal Diffusion Transformers, which directly eliminates the reliance on
implicit point matching. In concrete terms, our method generates an explicit
correspondence map from user drag inputs as a reliable reference to boost the
attention control. This reliable reference opens the potential for a stable
full-strength inversion process, which is the first in the drag-based editing
task. It obviates the necessity for TTO and unlocks the generative capability
of models. Therefore, LazyDrag naturally unifies precise geometric control with
text guidance, enabling complex edits that were previously out of reach:
opening the mouth of a dog and inpainting its interior, generating new objects
like a ``tennis ball'', or for ambiguous drags, making context-aware changes
like moving a hand into a pocket. Additionally, LazyDrag supports multi-round
workflows with simultaneous move and scale operations. Evaluated on the
DragBench, our method outperforms baselines in drag accuracy and perceptual
quality, as validated by VIEScore and human evaluation. LazyDrag not only
establishes new state-of-the-art performance, but also paves a new way to
editing paradigms.