SelfEval: 생성 모델의 판별적 특성을 활용한 평가 기법

초록

본 연구에서는 텍스트-이미지 생성 모델이 '역전'될 수 있음을 보여주며, 이를 통해 모델 자체의 텍스트-이미지 이해 능력을 완전히 자동화된 방식으로 평가할 수 있음을 입증합니다. 우리의 방법인 SelfEval은 생성 모델을 사용하여 텍스트 프롬프트가 주어졌을 때 실제 이미지의 가능성을 계산함으로써, 생성 모델을 판별 작업에 직접 적용할 수 있게 합니다. SelfEval을 통해, 우리는 멀티모달 텍스트-이미지 판별 모델 평가를 위해 만들어진 표준 데이터셋을 재활용하여 생성 모델을 세밀하게 평가합니다: 속성 결합, 색상 인식, 개수 세기, 형태 인식, 공간 이해 등의 성능을 평가합니다. 우리가 아는 한, SelfEval은 여러 모델과 벤치마크에 걸쳐 텍스트 충실도를 측정하는 데 있어 인간 평가와 높은 일치도를 보이는 최초의 자동화된 지표입니다. 또한, SelfEval은 Winoground 이미지 점수와 같은 도전적인 작업에서 생성 모델을 평가할 수 있게 하며, 이때 생성 모델이 판별 모델과 경쟁력 있는 성능을 보임을 입증합니다. 우리는 DrawBench와 같은 벤치마크에서 텍스트 충실도를 측정하는 데 있어 CLIP 점수와 같은 표준 자동화 지표의 심각한 단점을 보여주고, SelfEval이 이러한 문제를 어떻게 우회하는지도 보여줍니다. 우리는 SelfEval이 확산 모델에 대한 쉽고 신뢰할 수 있는 자동화 평가를 가능하게 하길 바랍니다.

English

In this work, we show that text-to-image generative models can be 'inverted' to assess their own text-image understanding capabilities in a completely automated manner. Our method, called SelfEval, uses the generative model to compute the likelihood of real images given text prompts, making the generative model directly applicable to discriminative tasks. Using SelfEval, we repurpose standard datasets created for evaluating multimodal text-image discriminative models to evaluate generative models in a fine-grained manner: assessing their performance on attribute binding, color recognition, counting, shape recognition, spatial understanding. To the best of our knowledge SelfEval is the first automated metric to show a high degree of agreement for measuring text-faithfulness with the gold-standard human evaluations across multiple models and benchmarks. Moreover, SelfEval enables us to evaluate generative models on challenging tasks such as Winoground image-score where they demonstrate competitive performance to discriminative models. We also show severe drawbacks of standard automated metrics such as CLIP-score to measure text faithfulness on benchmarks such as DrawBench, and how SelfEval sidesteps these issues. We hope SelfEval enables easy and reliable automated evaluation for diffusion models.

SelfEval: 생성 모델의 판별적 특성을 활용한 평가 기법

SelfEval: Leveraging the discriminative nature of generative models for evaluation

초록

Support