RefEdit：基于指代表达的指令图像编辑模型改进基准与方法

摘要

尽管在图像反演和基于指令的图像编辑方面取得了最新进展，现有方法主要擅长编辑单一、显著的对象，但在处理包含多个实体的复杂场景时表现显著不足。为量化这一差距，我们首先引入了RefEdit-Bench，这是一个基于RefCOCO的严格现实世界基准测试，即使是在数百万样本上训练的基线模型也表现不佳。为克服这一局限，我们提出了RefEdit——一种基于指令的编辑模型，通过我们可扩展的合成数据生成管道进行训练。我们的RefEdit仅使用20,000个编辑三元组进行训练，便超越了基于Flux/SD3模型、在数百万数据上训练的基线模型。跨多个基准的广泛评估表明，我们的模型不仅在指代表达任务中表现出色，还提升了在传统基准上的性能，达到了与闭源方法相当的最先进水平。我们发布了数据与检查点以确保可复现性。

English

Despite recent advances in inversion and instruction-based image editing, existing approaches primarily excel at editing single, prominent objects but significantly struggle when applied to complex scenes containing multiple entities. To quantify this gap, we first introduce RefEdit-Bench, a rigorous real-world benchmark rooted in RefCOCO, where even baselines trained on millions of samples perform poorly. To overcome this limitation, we introduce RefEdit -- an instruction-based editing model trained on our scalable synthetic data generation pipeline. Our RefEdit, trained on only 20,000 editing triplets, outperforms the Flux/SD3 model-based baselines trained on millions of data. Extensive evaluations across various benchmarks demonstrate that our model not only excels in referring expression tasks but also enhances performance on traditional benchmarks, achieving state-of-the-art results comparable to closed-source methods. We release data \& checkpoint for reproducibility.

RefEdit：基于指代表达的指令图像编辑模型改进基准与方法

RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions

摘要

Support