RefEdit：基於指代表達的指令式圖像編輯模型改進基準與方法

摘要

儘管在反轉和基於指令的圖像編輯方面取得了最新進展，現有方法主要擅長於編輯單一、顯著的物體，但在應用於包含多個實體的複雜場景時卻顯著受限。為量化這一差距，我們首先引入了RefEdit-Bench，這是一個基於RefCOCO的嚴格現實世界基準測試，即使是在數百萬樣本上訓練的基線模型也表現不佳。為克服這一限制，我們提出了RefEdit——一種基於指令的編輯模型，該模型在我們可擴展的合成數據生成管道上進行訓練。我們的RefEdit僅在20,000個編輯三元組上訓練，便超越了基於Flux/SD3模型、在數百萬數據上訓練的基線模型。跨多個基準測試的廣泛評估表明，我們的模型不僅在指代表達任務中表現出色，還提升了在傳統基準測試上的性能，達到了與閉源方法相當的頂尖水平。我們公佈了數據和檢查點以確保可重現性。

English

Despite recent advances in inversion and instruction-based image editing, existing approaches primarily excel at editing single, prominent objects but significantly struggle when applied to complex scenes containing multiple entities. To quantify this gap, we first introduce RefEdit-Bench, a rigorous real-world benchmark rooted in RefCOCO, where even baselines trained on millions of samples perform poorly. To overcome this limitation, we introduce RefEdit -- an instruction-based editing model trained on our scalable synthetic data generation pipeline. Our RefEdit, trained on only 20,000 editing triplets, outperforms the Flux/SD3 model-based baselines trained on millions of data. Extensive evaluations across various benchmarks demonstrate that our model not only excels in referring expression tasks but also enhances performance on traditional benchmarks, achieving state-of-the-art results comparable to closed-source methods. We release data \& checkpoint for reproducibility.

RefEdit：基於指代表達的指令式圖像編輯模型改進基準與方法

RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions

摘要

Support