ImagenHub: 조건부 이미지 생성 모델 평가의 표준화

초록

최근 다양한 조건부 이미지 생성 및 편집 모델이 텍스트-이미지 생성, 텍스트 기반 이미지 편집, 주체 기반 이미지 생성, 제어 기반 이미지 생성 등 다양한 하위 작업을 위해 개발되었습니다. 그러나 실험 조건(데이터셋, 추론, 평가 지표)에서 큰 불일치가 관찰되어 공정한 비교를 어렵게 만듭니다. 본 논문은 모든 조건부 이미지 생성 모델의 추론과 평가를 표준화하기 위한 원스톱 라이브러리인 ImagenHub를 제안합니다. 첫째, 7가지 주요 작업을 정의하고 이를 위한 고품질 평가 데이터셋을 구축했습니다. 둘째, 공정한 비교를 보장하기 위해 통합 추론 파이프라인을 구축했습니다. 셋째, 생성된 이미지를 평가하기 위해 의미적 일관성(Semantic Consistency)과 지각적 품질(Perceptual Quality)이라는 두 가지 인간 평가 점수와 포괄적인 가이드라인을 설계했습니다. 우리는 전문 평가자를 훈련시켜 제안된 지표를 기반으로 모델 출력을 평가하도록 했습니다. 인간 평가는 76%의 모델에서 Krippendorff's alpha 값이 0.4 이상으로 높은 평가자 간 일치도를 달성했습니다. 총 약 30개의 모델을 포괄적으로 평가한 결과 세 가지 주요 인사이트를 얻었습니다: (1) 텍스트 기반 이미지 생성과 주체 기반 이미지 생성을 제외한 기존 모델의 성능은 대체로 만족스럽지 않았으며, 74%의 모델이 전체 점수 0.5 미만을 기록했습니다. (2) 발표된 논문의 주장을 검토한 결과 83%가 몇 가지 예외를 제외하고 유효했습니다. (3) 주체 기반 이미지 생성을 제외하고는 기존의 자동 평가 지표 중 Spearman 상관계수가 0.2를 초과하는 경우가 없었습니다. 앞으로 우리는 새로 발표된 모델을 계속 평가하고 리더보드를 업데이트하여 조건부 이미지 생성 분야의 발전을 추적할 계획입니다.

English

Recently, a myriad of conditional image generation and editing models have been developed to serve different downstream tasks, including text-to-image generation, text-guided image editing, subject-driven image generation, control-guided image generation, etc. However, we observe huge inconsistencies in experimental conditions: datasets, inference, and evaluation metrics - render fair comparisons difficult. This paper proposes ImagenHub, which is a one-stop library to standardize the inference and evaluation of all the conditional image generation models. Firstly, we define seven prominent tasks and curate high-quality evaluation datasets for them. Secondly, we built a unified inference pipeline to ensure fair comparison. Thirdly, we design two human evaluation scores, i.e. Semantic Consistency and Perceptual Quality, along with comprehensive guidelines to evaluate generated images. We train expert raters to evaluate the model outputs based on the proposed metrics. Our human evaluation achieves a high inter-worker agreement of Krippendorff's alpha on 76% models with a value higher than 0.4. We comprehensively evaluated a total of around 30 models and observed three key takeaways: (1) the existing models' performance is generally unsatisfying except for Text-guided Image Generation and Subject-driven Image Generation, with 74% models achieving an overall score lower than 0.5. (2) we examined the claims from published papers and found 83% of them hold with a few exceptions. (3) None of the existing automatic metrics has a Spearman's correlation higher than 0.2 except subject-driven image generation. Moving forward, we will continue our efforts to evaluate newly published models and update our leaderboard to keep track of the progress in conditional image generation.

ImagenHub: 조건부 이미지 생성 모델 평가의 표준화

ImagenHub: Standardizing the evaluation of conditional image generation models

초록

Support