DragFlow:基於區域監督的DiT先驗釋放於拖拽編輯
DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing
October 2, 2025
作者: Zihan Zhou, Shilin Lu, Shuli Leng, Shaocong Zhang, Zhuming Lian, Xinlei Yu, Adams Wai-Kin Kong
cs.AI
摘要
基於拖拽的圖像編輯長期以來一直受到目標區域失真的困擾,這主要是因為早期基礎模型(如Stable Diffusion)的先驗知識不足以將優化後的潛在變量投影回自然圖像流形。隨著從基於UNet的DDPM轉向更具可擴展性的DiT與流匹配(例如SD3.5、FLUX),生成先驗顯著增強,推動了多樣化編輯任務的進步。然而,基於拖拽的編輯尚未從這些更強的先驗中受益。本研究提出了首個有效利用FLUX豐富先驗進行拖拽編輯的框架,名為DragFlow,相較於基線方法取得了顯著提升。我們首先指出,直接將基於點的拖拽編輯應用於DiT效果不佳:與UNet高度壓縮的特徵不同,DiT的特徵結構不足以為點級運動監督提供可靠指導。為克服這一限制,DragFlow引入了基於區域的編輯範式,其中仿射變換實現了更豐富且一致的特徵監督。此外,我們整合了預訓練的開放域個性化適配器(如IP-Adapter)以增強主體一致性,同時通過基於梯度掩碼的硬約束保持背景保真度。多模態大語言模型(MLLMs)進一步用於解決任務歧義。為評估效果,我們構建了一個新穎的基於區域拖拽的基準測試(ReD Bench),包含區域級別的拖拽指令。在DragBench-DR和ReD Bench上的大量實驗表明,DragFlow超越了基於點和基於區域的基線方法,為基於拖拽的圖像編輯設定了新的技術標準。代碼和數據集將在論文發表後公開提供。
English
Drag-based image editing has long suffered from distortions in the target
region, largely because the priors of earlier base models, Stable Diffusion,
are insufficient to project optimized latents back onto the natural image
manifold. With the shift from UNet-based DDPMs to more scalable DiT with flow
matching (e.g., SD3.5, FLUX), generative priors have become significantly
stronger, enabling advances across diverse editing tasks. However, drag-based
editing has yet to benefit from these stronger priors. This work proposes the
first framework to effectively harness FLUX's rich prior for drag-based
editing, dubbed DragFlow, achieving substantial gains over baselines. We first
show that directly applying point-based drag editing to DiTs performs poorly:
unlike the highly compressed features of UNets, DiT features are insufficiently
structured to provide reliable guidance for point-wise motion supervision. To
overcome this limitation, DragFlow introduces a region-based editing paradigm,
where affine transformations enable richer and more consistent feature
supervision. Additionally, we integrate pretrained open-domain personalization
adapters (e.g., IP-Adapter) to enhance subject consistency, while preserving
background fidelity through gradient mask-based hard constraints. Multimodal
large language models (MLLMs) are further employed to resolve task ambiguities.
For evaluation, we curate a novel Region-based Dragging benchmark (ReD Bench)
featuring region-level dragging instructions. Extensive experiments on
DragBench-DR and ReD Bench show that DragFlow surpasses both point-based and
region-based baselines, setting a new state-of-the-art in drag-based image
editing. Code and datasets will be publicly available upon publication.