RefEdit: 参照表現に基づく指示型画像編集モデルの改善のためのベンチマークと手法

要旨

逆変換や指示ベースの画像編集における最近の進展にもかかわらず、既存の手法は主に単一の目立つオブジェクトの編集に優れているものの、複数のエンティティを含む複雑なシーンに適用すると大幅に苦戦しています。このギャップを定量化するため、まずRefCOCOに基づいた厳密な実世界ベンチマークであるRefEdit-Benchを導入しました。このベンチマークでは、数百万のサンプルで訓練されたベースラインでさえも低い性能しか示しません。この制限を克服するために、私たちはスケーラブルな合成データ生成パイプラインで訓練された指示ベースの編集モデルであるRefEditを導入しました。わずか20,000の編集トリプレットで訓練されたRefEditは、数百万のデータで訓練されたFlux/SD3モデルベースのベースラインを上回ります。さまざまなベンチマークでの広範な評価により、私たちのモデルが参照表現タスクで優れているだけでなく、従来のベンチマークでの性能も向上し、クローズドソース手法に匹敵する最先端の結果を達成することが示されました。再現性のためにデータとチェックポイントを公開します。

English

Despite recent advances in inversion and instruction-based image editing, existing approaches primarily excel at editing single, prominent objects but significantly struggle when applied to complex scenes containing multiple entities. To quantify this gap, we first introduce RefEdit-Bench, a rigorous real-world benchmark rooted in RefCOCO, where even baselines trained on millions of samples perform poorly. To overcome this limitation, we introduce RefEdit -- an instruction-based editing model trained on our scalable synthetic data generation pipeline. Our RefEdit, trained on only 20,000 editing triplets, outperforms the Flux/SD3 model-based baselines trained on millions of data. Extensive evaluations across various benchmarks demonstrate that our model not only excels in referring expression tasks but also enhances performance on traditional benchmarks, achieving state-of-the-art results comparable to closed-source methods. We release data \& checkpoint for reproducibility.

RefEdit: 参照表現に基づく指示型画像編集モデルの改善のためのベンチマークと手法

RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions

要旨

Support