DragFlow:基于区域监督的DiT先验释放技术助力拖拽编辑
DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing
October 2, 2025
作者: Zihan Zhou, Shilin Lu, Shuli Leng, Shaocong Zhang, Zhuming Lian, Xinlei Yu, Adams Wai-Kin Kong
cs.AI
摘要
基于拖拽的图像编辑长期以来一直受限于目标区域的失真问题,这主要归因于早期基础模型(如Stable Diffusion)的先验知识不足以将优化后的潜在空间映射回自然图像流形。随着从基于UNet的DDPM向更具可扩展性的DiT结合流匹配(例如SD3.5、FLUX)的转变,生成先验显著增强,推动了多样化编辑任务的进步。然而,基于拖拽的编辑尚未从这些更强的先验中获益。本研究提出了首个有效利用FLUX丰富先验进行拖拽编辑的框架,命名为DragFlow,相较于基线方法取得了显著提升。我们首先指出,直接将基于点的拖拽编辑应用于DiT效果不佳:与UNet高度压缩的特征不同,DiT的特征结构不够完善,无法为点级运动监督提供可靠指导。为克服这一局限,DragFlow引入了基于区域的编辑范式,通过仿射变换实现更丰富且一致的特征监督。此外,我们整合了预训练的开域个性化适配器(如IP-Adapter),以增强主体一致性,同时通过基于梯度掩码的硬约束保持背景保真度。多模态大语言模型(MLLMs)进一步用于解决任务歧义。为评估性能,我们构建了一个新颖的基于区域的拖拽基准(ReD Bench),包含区域级拖拽指令。在DragBench-DR和ReD Bench上的大量实验表明,DragFlow超越了基于点和基于区域的基线方法,确立了基于拖拽图像编辑的新标杆。代码与数据集将在论文发表后公开。
English
Drag-based image editing has long suffered from distortions in the target
region, largely because the priors of earlier base models, Stable Diffusion,
are insufficient to project optimized latents back onto the natural image
manifold. With the shift from UNet-based DDPMs to more scalable DiT with flow
matching (e.g., SD3.5, FLUX), generative priors have become significantly
stronger, enabling advances across diverse editing tasks. However, drag-based
editing has yet to benefit from these stronger priors. This work proposes the
first framework to effectively harness FLUX's rich prior for drag-based
editing, dubbed DragFlow, achieving substantial gains over baselines. We first
show that directly applying point-based drag editing to DiTs performs poorly:
unlike the highly compressed features of UNets, DiT features are insufficiently
structured to provide reliable guidance for point-wise motion supervision. To
overcome this limitation, DragFlow introduces a region-based editing paradigm,
where affine transformations enable richer and more consistent feature
supervision. Additionally, we integrate pretrained open-domain personalization
adapters (e.g., IP-Adapter) to enhance subject consistency, while preserving
background fidelity through gradient mask-based hard constraints. Multimodal
large language models (MLLMs) are further employed to resolve task ambiguities.
For evaluation, we curate a novel Region-based Dragging benchmark (ReD Bench)
featuring region-level dragging instructions. Extensive experiments on
DragBench-DR and ReD Bench show that DragFlow surpasses both point-based and
region-based baselines, setting a new state-of-the-art in drag-based image
editing. Code and datasets will be publicly available upon publication.