편집자 추천: 원자적 개체 분석을 통한 이미지 편집에서의 추상적 의도 평가

초록

인간은 자연스럽게 "분위기"와 같은 추상적 개념을 통해 소통합니다. 그러나 현재의 이미지 편집 벤치마크는 주로 명시적이고 직설적인 명령어에 초점을 맞추고 있어 추상적 명령어는 대부분 탐구되지 않은 상태입니다. 본 연구에서는 먼저 추상적 이미지 편집의 정의와 분류 체계를 정립합니다. 이 도전적인 분야에서 명령 수행 능력을 측정하기 위해, 우리는 추상적 편집을 개별 개체 수준의 평가로 분해하고 인간 판단과 강한 상관관계를 달성하는 프레임워크인 Entity-Rubrics를 제안합니다. 이 프레임워크와 함께, 다양한 실제 장면을 포괄하는 추상적 이미지 편집 전용 최초의 벤치마크인 AbstractEdit을 구축했습니다. 11개의 주요 모델을 이 데이터셋으로 평가한 결과, 근본적인 과제가 드러났습니다. 표준 아키텍처는 의도와 보존 사이의 균형을 맞추는 데 어려움을 겪으며, 일반적으로 과소 편집이나 과잉 편집에 치우칩니다. 우리의 분석은 의미 있는 개선을 위해서는 고급 LLM 텍스트 인코더와 반복적 사고를 통합하는 것이 필수적임을 보여줍니다. 미래를 바라보며, 우리의 개체 기반 패러다임은 평가를 넘어 보상 모델로 확장되거나, 모델이 추상적 의사소통을 올바르게 해석하도록 돕거나, 테스트 시간 비평 루프에서 특정 실패를 강조하는 데 사용될 수 있습니다. 궁극적으로, 본 연구가 경직된 기계 실행과 인간의 자연스럽고 개방적인 의사소통 방식 사이의 간극을 좁히는 원활한 다중 모달 상호작용을 위한 초석이 되기를 바랍니다.

English

Humans naturally communicate through abstract concepts like "mood". However, current image editing benchmarks focus primarily on explicit, literal commands, leaving abstract instructions largely underexplored. In this work, we first formalize the definition and taxonomy of abstract image editing. To measure instruction-following in this challenging domain, we introduce Entity-Rubrics, a framework that breaks down abstract edits into individual, entity-level assessments and achieves strong correlation with human judgment. Alongside this framework, we contribute AbstractEdit, the first benchmark dedicated to abstract image editing across diverse real-world scenes. Evaluating 11 leading models on this dataset reveals a fundamental challenge: standard architectures struggle to balance intent and preservation, commonly defaulting to under-editing or over-editing. Our analysis demonstrates that driving meaningful improvements relies heavily on integrating advanced LLM text encoders and iterative thinking. Looking forward, our entity-based paradigm can generalize beyond assessment to serve as a reward model, enable models to correctly interpret abstract communication, or highlight specific failures in test-time critique loops. Ultimately, we hope this work serves as a stepping stone toward seamless multimodal interaction, closing the gap between rigid machine execution and the natural, open-ended way humans communicate.