RefEdit: 참조 표현을 기반으로 한 지시 기반 이미지 편집 모델의 성능 향상을 위한 벤치마크 및 방법

초록

최근 역전 및 지시 기반 이미지 편집 기술의 발전에도 불구하고, 기존 접근법은 단일, 두드러진 객체 편집에서 뛰어난 성과를 보이지만, 다중 개체를 포함한 복잡한 장면에 적용할 때는 상당한 어려움을 겪는다. 이러한 격차를 정량화하기 위해, 우리는 먼저 RefCOCO에 기반한 엄격한 실세계 벤치마크인 RefEdit-Bench를 소개한다. 이 벤치마크에서는 수백만 개의 샘플로 훈련된 베이스라인 모델조차도 낮은 성능을 보인다. 이러한 한계를 극복하기 위해, 우리는 확장 가능한 합성 데이터 생성 파이프라인으로 훈련된 지시 기반 편집 모델인 RefEdit을 제안한다. 단 20,000개의 편집 트리플렛으로 훈련된 우리의 RefEdit은 수백만 개의 데이터로 훈련된 Flux/SD3 모델 기반 베이스라인을 능가한다. 다양한 벤치마크에 걸친 광범위한 평가를 통해, 우리의 모델이 참조 표현 작업에서 뛰어난 성과를 보일 뿐만 아니라 전통적인 벤치마크에서도 성능을 향상시키며, 폐쇄형 소스 방법에 필적하는 최첨단 결과를 달성함을 입증한다. 재현성을 위해 데이터 및 체크포인트를 공개한다.

English

Despite recent advances in inversion and instruction-based image editing, existing approaches primarily excel at editing single, prominent objects but significantly struggle when applied to complex scenes containing multiple entities. To quantify this gap, we first introduce RefEdit-Bench, a rigorous real-world benchmark rooted in RefCOCO, where even baselines trained on millions of samples perform poorly. To overcome this limitation, we introduce RefEdit -- an instruction-based editing model trained on our scalable synthetic data generation pipeline. Our RefEdit, trained on only 20,000 editing triplets, outperforms the Flux/SD3 model-based baselines trained on millions of data. Extensive evaluations across various benchmarks demonstrate that our model not only excels in referring expression tasks but also enhances performance on traditional benchmarks, achieving state-of-the-art results comparable to closed-source methods. We release data \& checkpoint for reproducibility.

RefEdit: 참조 표현을 기반으로 한 지시 기반 이미지 편집 모델의 성능 향상을 위한 벤치마크 및 방법

RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions

초록

Support