LazyDrag：通過顯式對應實現多模態擴散變換器上的穩定拖拽式編輯

摘要

基於注意力機制的隱式點匹配已成為拖拽式編輯的核心瓶頸，導致了反轉強度減弱和測試時優化（TTO）成本高昂的根本性妥協。這一妥協嚴重限制了擴散模型的生成能力，抑制了高保真度的圖像修復和文本引導創作。本文中，我們提出了LazyDrag，這是首個針對多模態擴散變壓器的拖拽式圖像編輯方法，它直接消除了對隱式點匹配的依賴。具體而言，我們的方法從用戶拖拽輸入中生成顯式對應映射，作為增強注意力控制的可靠參考。這一可靠參考為穩定的全強度反轉過程開闢了可能性，這在拖拽式編輯任務中尚屬首次。它消除了TTO的必要性，並釋放了模型的生成潛力。因此，LazyDrag自然統一了精確的幾何控制與文本引導，實現了以往難以企及的複雜編輯：如打開狗的嘴巴並修復其內部，生成新物體如“網球”，或對於模糊的拖拽，做出上下文感知的改變，如將手移入口袋。此外，LazyDrag支持多輪工作流程，可同時進行移動和縮放操作。在DragBench上的評估顯示，我們的方法在拖拽準確性和感知質量上均優於基線，這得到了VIEScore和人類評估的驗證。LazyDrag不僅建立了新的性能標杆，還為編輯範式開闢了新路徑。

English

The reliance on implicit point matching via attention has become a core bottleneck in drag-based editing, resulting in a fundamental compromise on weakened inversion strength and costly test-time optimization (TTO). This compromise severely limits the generative capabilities of diffusion models, suppressing high-fidelity inpainting and text-guided creation. In this paper, we introduce LazyDrag, the first drag-based image editing method for Multi-Modal Diffusion Transformers, which directly eliminates the reliance on implicit point matching. In concrete terms, our method generates an explicit correspondence map from user drag inputs as a reliable reference to boost the attention control. This reliable reference opens the potential for a stable full-strength inversion process, which is the first in the drag-based editing task. It obviates the necessity for TTO and unlocks the generative capability of models. Therefore, LazyDrag naturally unifies precise geometric control with text guidance, enabling complex edits that were previously out of reach: opening the mouth of a dog and inpainting its interior, generating new objects like a ``tennis ball'', or for ambiguous drags, making context-aware changes like moving a hand into a pocket. Additionally, LazyDrag supports multi-round workflows with simultaneous move and scale operations. Evaluated on the DragBench, our method outperforms baselines in drag accuracy and perceptual quality, as validated by VIEScore and human evaluation. LazyDrag not only establishes new state-of-the-art performance, but also paves a new way to editing paradigms.

LazyDrag：通過顯式對應實現多模態擴散變換器上的穩定拖拽式編輯

LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence

摘要

Support