WISE: 텍스트-이미지 생성을 위한 세계 지식 기반 의미론적 평가

초록

텍스트-이미지(T2I) 모델은 고품질의 예술적 창작물과 시각적 콘텐츠를 생성할 수 있습니다. 그러나 기존의 연구와 평가 기준은 주로 이미지의 사실성과 피상적인 텍스트-이미지 정렬에 초점을 맞추고 있어, 텍스트에서 이미지 생성 과정에서의 복잡한 의미 이해와 세계 지식 통합에 대한 포괄적인 평가가 부족합니다. 이러한 문제를 해결하기 위해, 우리는 세계 지식 기반 의미 평가를 위해 특별히 설계된 첫 번째 벤치마크인 WISE를 제안합니다. WISE는 단순한 단어-픽셀 매핑을 넘어 문화적 상식, 시공간적 추론, 자연과학 등 25개 하위 도메인에 걸쳐 신중하게 구성된 1,000개의 프롬프트를 통해 모델을 평가합니다. 또한 기존 CLIP 메트릭의 한계를 극복하기 위해, 지식-이미지 정렬을 평가하는 새로운 정량적 메트릭인 WiScore를 도입했습니다. 25개 하위 도메인에 걸친 1,000개의 구조화된 프롬프트를 사용하여 20개 모델(전용 T2I 모델 10개와 통합 멀티모달 모델 10개)을 종합적으로 테스트한 결과, 이들이 이미지 생성 과정에서 세계 지식을 효과적으로 통합하고 적용하는 데 있어 상당한 한계가 있음을 확인하였으며, 차세대 T2I 모델에서 지식 통합과 적용을 강화할 수 있는 중요한 방향성을 제시합니다. 코드와 데이터는 https://github.com/PKU-YuanGroup/WISE에서 확인할 수 있습니다.

English

Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text to image generation. To address this challenge, we propose WISE, the first benchmark specifically designed for World Knowledge-Informed Semantic Evaluation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 sub-domains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce WiScore, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at https://github.com/PKU-YuanGroup/WISE.