DragFlow: 領域ベースの監視によるDiT事前知識の活用を実現するドラッグ編集

要旨

ドラッグベースの画像編集は長らくターゲット領域の歪みに悩まされてきました。その主な原因は、従来のベースモデルであるStable Diffusionの事前分布が、最適化された潜在変数を自然画像多様体に射影するのに不十分だったためです。UNetベースのDDPMから、よりスケーラブルなDiTとフローマッチング（例：SD3.5、FLUX）への移行に伴い、生成モデルの事前分布は大幅に強化され、多様な編集タスクで進展が見られました。しかし、ドラッグベースの編集はこれらの強化された事前分布の恩恵をまだ受けていません。本研究では、FLUXの豊富な事前分布をドラッグベースの編集に効果的に活用する初のフレームワーク「DragFlow」を提案し、ベースラインを大幅に上回る成果を達成しました。まず、DiTにポイントベースのドラッグ編集を直接適用すると性能が低いことを示します。UNetの高度に圧縮された特徴とは異なり、DiTの特徴は構造化が不十分で、ポイント単位のモーション監視に信頼性のあるガイダンスを提供できません。この制限を克服するため、DragFlowはリージョンベースの編集パラダイムを導入し、アフィン変換によりより豊かで一貫性のある特徴監視を可能にします。さらに、事前学習済みのオープンドメインパーソナライゼーションアダプター（例：IP-Adapter）を統合し、被写体の一貫性を向上させつつ、勾配マスクベースのハード制約を通じて背景の忠実度を維持します。マルチモーダル大規模言語モデル（MLLM）をさらに活用して、タスクの曖昧さを解決します。評価のために、リージョンレベルのドラッグ指示を特徴とする新しいリージョンベースドラッギングベンチマーク（ReD Bench）をキュレーションしました。DragBench-DRとReD Benchでの広範な実験により、DragFlowがポイントベースおよびリージョンベースのベースラインを上回り、ドラッグベース画像編集の新たな最先端を確立することが示されました。コードとデータセットは公開時に一般公開されます。

English

Drag-based image editing has long suffered from distortions in the target region, largely because the priors of earlier base models, Stable Diffusion, are insufficient to project optimized latents back onto the natural image manifold. With the shift from UNet-based DDPMs to more scalable DiT with flow matching (e.g., SD3.5, FLUX), generative priors have become significantly stronger, enabling advances across diverse editing tasks. However, drag-based editing has yet to benefit from these stronger priors. This work proposes the first framework to effectively harness FLUX's rich prior for drag-based editing, dubbed DragFlow, achieving substantial gains over baselines. We first show that directly applying point-based drag editing to DiTs performs poorly: unlike the highly compressed features of UNets, DiT features are insufficiently structured to provide reliable guidance for point-wise motion supervision. To overcome this limitation, DragFlow introduces a region-based editing paradigm, where affine transformations enable richer and more consistent feature supervision. Additionally, we integrate pretrained open-domain personalization adapters (e.g., IP-Adapter) to enhance subject consistency, while preserving background fidelity through gradient mask-based hard constraints. Multimodal large language models (MLLMs) are further employed to resolve task ambiguities. For evaluation, we curate a novel Region-based Dragging benchmark (ReD Bench) featuring region-level dragging instructions. Extensive experiments on DragBench-DR and ReD Bench show that DragFlow surpasses both point-based and region-based baselines, setting a new state-of-the-art in drag-based image editing. Code and datasets will be publicly available upon publication.

DragFlow: 領域ベースの監視によるDiT事前知識の活用を実現するドラッグ編集

DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing

要旨

Support