安定したフロー：トレーニング不要の画像編集における重要なレイヤー

要旨

拡散モデルはコンテンツ合成と編集の分野を革新しました。最近のモデルでは、伝統的なUNetアーキテクチャをDiffusion Transformer（DiT）で置き換え、トレーニングとサンプリングの改善のためにフローマッチングを採用しています。しかし、これらのモデルは生成の多様性に制限があります。本研究では、この制限を活用して、注目特徴の選択的注入を通じて一貫した画像編集を行います。主な課題は、UNetベースのモデルとは異なり、DiTには粗から細の合成構造がないため、どのレイヤーで注入を行うかが不明確です。そのため、DiT内の画像形成に重要な「重要なレイヤー」を特定する自動方法を提案し、これらのレイヤーが非剛体変更からオブジェクト追加までの一連の制御された安定した編集を可能にする方法を示します。次に、実画像編集を可能にするために、フローモデル向けの改良された画像反転方法を導入します。最後に、質的および量的比較、ユーザースタディを通じてアプローチを評価し、複数のアプリケーションでの効果を示します。プロジェクトページはhttps://omriavrahami.com/stable-flowで入手可能です。

English

Diffusion models have revolutionized the field of content synthesis and editing. Recent models have replaced the traditional UNet architecture with the Diffusion Transformer (DiT), and employed flow-matching for improved training and sampling. However, they exhibit limited generation diversity. In this work, we leverage this limitation to perform consistent image edits via selective injection of attention features. The main challenge is that, unlike the UNet-based models, DiT lacks a coarse-to-fine synthesis structure, making it unclear in which layers to perform the injection. Therefore, we propose an automatic method to identify "vital layers" within DiT, crucial for image formation, and demonstrate how these layers facilitate a range of controlled stable edits, from non-rigid modifications to object addition, using the same mechanism. Next, to enable real-image editing, we introduce an improved image inversion method for flow models. Finally, we evaluate our approach through qualitative and quantitative comparisons, along with a user study, and demonstrate its effectiveness across multiple applications. The project page is available at https://omriavrahami.com/stable-flow

安定したフロー：トレーニング不要の画像編集における重要なレイヤー

Stable Flow: Vital Layers for Training-Free Image Editing

要旨

Support