RelationAdapter: 拡散トランスフォーマーを用いた視覚的関係の学習と転移

要旨

大規模言語モデル（LLM）の文脈内学習メカニズムに着想を得て、汎用的な視覚プロンプトに基づく画像編集の新たなパラダイムが登場しつつある。既存の単一参照手法は、通常、スタイルや外観の調整に焦点を当てており、非剛体変換には苦戦している。これらの制限に対処するため、我々はソース-ターゲット画像ペアを活用して、コンテンツを意識した編集意図を抽出し、新しいクエリ画像に転送することを提案する。この目的のために、我々はRelationAdapterを導入する。これは、Diffusion Transformer（DiT）ベースのモデルが最小限の例から視覚的変換を効果的に捕捉し適用することを可能にする軽量モジュールである。また、視覚プロンプト駆動のシナリオにおけるモデルの汎化性と適応性を評価するために、218の多様な編集タスクを含む包括的なデータセットRelation252Kを導入する。Relation252Kでの実験により、RelationAdapterが編集意図の理解と転送能力を大幅に向上させ、生成品質と全体的な編集性能において顕著な向上をもたらすことが示された。

English

Inspired by the in-context learning mechanism of large language models (LLMs), a new paradigm of generalizable visual prompt-based image editing is emerging. Existing single-reference methods typically focus on style or appearance adjustments and struggle with non-rigid transformations. To address these limitations, we propose leveraging source-target image pairs to extract and transfer content-aware editing intent to novel query images. To this end, we introduce RelationAdapter, a lightweight module that enables Diffusion Transformer (DiT) based models to effectively capture and apply visual transformations from minimal examples. We also introduce Relation252K, a comprehensive dataset comprising 218 diverse editing tasks, to evaluate model generalization and adaptability in visual prompt-driven scenarios. Experiments on Relation252K show that RelationAdapter significantly improves the model's ability to understand and transfer editing intent, leading to notable gains in generation quality and overall editing performance.

RelationAdapter: 拡散トランスフォーマーを用いた視覚的関係の学習と転移

RelationAdapter: Learning and Transferring Visual Relation with Diffusion Transformers

要旨

Support