PaintBench: 정밀 시각 편집의 결정론적 평가

초록

현재 멀티모달 모델은 개방형 시각 편집에 능숙하지만, 정확한 단일 정답 편집을 수행하는 것은 여전히 중요한 장애물로 남아 있다. 이러한 과제를 탐구하기 위해 우리는 기하 변환, 구조 조작, 색상 변경, 상징적 추론의 네 가지 범주에 걸친 20가지 기본 정밀 시각 편집 연산을 대상으로 하는 동적 확장 가능 벤치마크인 PaintBench를 제안한다. 구성 가능한 복잡성을 갖춘 절차적 생성은 사실상 무한하고 오염에 강한 평가 스위트를 가능하게 하며, 결정론적 픽셀 단위 평가는 편향에 취약한 판별 모델에 대한 의존성을 제거한다. 11개의 이미지 편집 모델에서 전반적으로 낮은 성능을 발견했으며, 현재 최고 성능의 업계 선도 모델은 17.1%(mIoU)에 불과했다. 작업 분해를 통해 특히 까다로운 연산 유형(기하 변환, 대부분의 구조 조작, 공식 기반 색상 변경)과 모델별 특화를 확인했다. 세분화된 벤치마크 진단은 객체 수, 배경 복잡성, 색 구성표, 편집 영역 크기의 장면 변동에 따른 성능 저하를 추가로 보여준다. PaintBench 점수의 응용 작업 성능에 대한 일반화를 테스트하기 위해 데이터 시각화 편집을 위한 절차적이고 결정론적인 평가(TinyGrafixBench)를 생성했으며, PaintBench 점수와 강한 선형 상관관계(R² = 0.91, p < 0.001)를 발견했다. 종합적으로, PaintBench는 정밀 멀티모달 시각 편집의 성과 측정 및 발전을 위한 엄격한 기반을 제공한다.

English

While current multimodal models are proficient at open-ended visual editing, executing precise single-answer edits remains an important obstacle. To probe this challenge, we introduce PaintBench, a dynamically scalable benchmark targeting 20 fundamental precise visual editing operations across four categories: geometric transformation, structural manipulation, color change, and symbolic reasoning. Procedural generation with configurable complexity enables an effectively infinite, contamination-resistant evaluation suite, and deterministic pixel-level evaluation eliminates reliance on bias-prone judge models. Across 11 image editing models, we find overall low performance, with the current highest-performing industry leader scoring only 17.1% (mIoU). Task decomposition reveals especially challenging operation types (geometric transformation, most structural manipulation, formula-based color change) and model-specific specializations. Fine-grained benchmark diagnostics further show performance degradations induced by scene variations in object count, background complexity, color scheme, and edit-region size. To test generalization of PaintBench scores to applied task performance, we create a procedural, deterministic evaluation for data visualization editing (TinyGrafixBench) and find strong linear correlation with PaintBench scores (R^2 = 0.91, p < 0.001). Altogether, PaintBench provides a rigorous foundation for measuring and driving progress in precise multimodal visual editing.