Complex-Edit: 복잡도 제어 가능한 이미지 편집 벤치마크를 위한 CoT 스타일 명령어 생성

초록

우리는 다양한 복잡도의 지시문을 통해 지시 기반 이미지 편집 모델을 체계적으로 평가하기 위해 설계된 포괄적인 벤치마크인 Complex-Edit를 소개합니다. 이 벤치마크를 개발하기 위해 GPT-4o를 활용하여 대규모로 다양한 편집 지시문을 자동으로 수집했습니다. 우리의 접근 방식은 잘 구조화된 "Chain-of-Edit" 파이프라인을 따릅니다: 먼저 개별적인 원자적 편집 작업을 독립적으로 생성한 후, 이를 통합하여 일관된 복잡한 지시문을 형성합니다. 또한, 편집 성능의 다양한 측면을 평가하기 위한 메트릭 세트와 대규모 평가를 지원하는 VLM 기반 자동 평가 파이프라인을 도입했습니다. 우리의 벤치마크는 몇 가지 주목할 만한 통찰을 제공합니다: 1) 오픈소스 모델은 독점적인 클로즈드소스 모델에 비해 상당히 낮은 성능을 보이며, 지시문의 복잡성이 증가할수록 성능 격차가 더욱 벌어집니다; 2) 지시문의 복잡성이 증가하면 모델이 입력 이미지의 주요 요소를 유지하고 전반적인 미적 품질을 보존하는 능력이 주로 저하됩니다; 3) 복잡한 지시문을 원자적 단계로 분해하여 단계별로 실행하면 여러 메트릭에서 성능이 크게 저하됩니다; 4) 간단한 Best-of-N 선택 전략은 직접 편집과 단계별 순차적 접근 모두에서 결과를 개선합니다; 5) 우리는 "합성 데이터의 저주"를 관찰했습니다: 모델 훈련에 합성 데이터가 포함되면, 이러한 모델에서 편집된 이미지는 편집 지시문의 복잡성이 증가함에 따라 점점 더 합성적으로 보이는 경향이 있으며, 이 현상은 흥미롭게도 최신 GPT-4o 출력에서도 나타납니다.

English

We introduce Complex-Edit, a comprehensive benchmark designed to systematically evaluate instruction-based image editing models across instructions of varying complexity. To develop this benchmark, we harness GPT-4o to automatically collect a diverse set of editing instructions at scale. Our approach follows a well-structured ``Chain-of-Edit'' pipeline: we first generate individual atomic editing tasks independently and then integrate them to form cohesive, complex instructions. Additionally, we introduce a suite of metrics to assess various aspects of editing performance, along with a VLM-based auto-evaluation pipeline that supports large-scale assessments. Our benchmark yields several notable insights: 1) Open-source models significantly underperform relative to proprietary, closed-source models, with the performance gap widening as instruction complexity increases; 2) Increased instructional complexity primarily impairs the models' ability to retain key elements from the input images and to preserve the overall aesthetic quality; 3) Decomposing a complex instruction into a sequence of atomic steps, executed in a step-by-step manner, substantially degrades performance across multiple metrics; 4) A straightforward Best-of-N selection strategy improves results for both direct editing and the step-by-step sequential approach; and 5) We observe a ``curse of synthetic data'': when synthetic data is involved in model training, the edited images from such models tend to appear increasingly synthetic as the complexity of the editing instructions rises -- a phenomenon that intriguingly also manifests in the latest GPT-4o outputs.

Complex-Edit: 복잡도 제어 가능한 이미지 편집 벤치마크를 위한 CoT 스타일 명령어 생성

Complex-Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark

초록

Support