OmniEdit: 専門家の監督を通じて画像編集の汎用モデルを構築する

要旨

指示に従った画像編集手法は、自動的に合成されたか手動で注釈付けされた画像編集ペアに拡散モデルをトレーニングすることで、大きな潜在能力を示しています。しかし、これらの手法は実用的な現実世界のアプリケーションからはまだ遠いです。このギャップに貢献する3つの主要な課題を特定します。まず、既存のモデルは、バイアスのかかった合成プロセスのために編集スキルが限られています。第二に、これらの手法は、ノイズやアーティファクトが多く含まれるデータセットでトレーニングされています。これは、CLIPスコアなどの単純なフィルタリング手法の適用によるものです。第三に、これらのデータセットはすべて単一の低解像度および固定アスペクト比に制限されており、実世界のユースケースを処理する柔軟性が制限されています。本論文では、任意のアスペクト比で7つの異なる画像編集タスクをシームレスに処理する万能エディタである\omnieditを提案します。私たちの貢献は、次の4つにあります：(1) \omnieditは、タスクカバレッジを確保するために7つの異なる専門モデルからの監督を利用してトレーニングされています。(2) CLIPスコアの代わりに、大規模なマルチモーダルモデル（GPT-4oなど）によって提供されるスコアに基づいた重要度サンプリングを利用してデータ品質を向上させています。(3) 編集の成功率を大幅に向上させる新しい編集アーキテクチャであるEditNetを提案しています。(4) さまざまなアスペクト比の画像を提供することで、モデルが野生の画像を処理できるようにしています。異なるタスクをカバーするために多様な指示が付属した、異なるアスペクト比の画像を含むテストセットを収集しました。自動評価と人間の評価の両方が、\omnieditがすべての既存モデルを大幅に上回ることを示しています。私たちのコード、データセット、モデルは、https://tiger-ai-lab.github.io/OmniEdit/ で入手可能です。

English

Instruction-guided image editing methods have demonstrated significant potential by training diffusion models on automatically synthesized or manually annotated image editing pairs. However, these methods remain far from practical, real-life applications. We identify three primary challenges contributing to this gap. Firstly, existing models have limited editing skills due to the biased synthesis process. Secondly, these methods are trained with datasets with a high volume of noise and artifacts. This is due to the application of simple filtering methods like CLIP-score. Thirdly, all these datasets are restricted to a single low resolution and fixed aspect ratio, limiting the versatility to handle real-world use cases. In this paper, we present \omniedit, which is an omnipotent editor to handle seven different image editing tasks with any aspect ratio seamlessly. Our contribution is in four folds: (1) \omniedit is trained by utilizing the supervision from seven different specialist models to ensure task coverage. (2) we utilize importance sampling based on the scores provided by large multimodal models (like GPT-4o) instead of CLIP-score to improve the data quality. (3) we propose a new editing architecture called EditNet to greatly boost the editing success rate, (4) we provide images with different aspect ratios to ensure that our model can handle any image in the wild. We have curated a test set containing images of different aspect ratios, accompanied by diverse instructions to cover different tasks. Both automatic evaluation and human evaluations demonstrate that \omniedit can significantly outperform all the existing models. Our code, dataset and model will be available at https://tiger-ai-lab.github.io/OmniEdit/

OmniEdit: 専門家の監督を通じて画像編集の汎用モデルを構築する

OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

要旨

Support