WISE：面向文本到圖像生成的世界知識語義評估

摘要

文本到圖像（T2I）模型能夠生成高品質的藝術創作和視覺內容。然而，現有的研究和評估標準主要集中於圖像的真實性和淺層的文本-圖像對齊，缺乏對文本到圖像生成中複雜語義理解和世界知識整合的全面評估。為應對這一挑戰，我們提出了WISE，這是首個專門為世界知識引導的語義評估設計的基準。WISE超越了簡單的詞語-像素映射，通過在文化常識、時空推理和自然科學等25個子領域中精心設計的1000個提示來挑戰模型。為了克服傳統CLIP指標的局限性，我們引入了WiScore，這是一種用於評估知識-圖像對齊的新穎定量指標。通過對20個模型（10個專用T2I模型和10個統一多模態模型）使用涵蓋25個子領域的1000個結構化提示進行全面測試，我們的研究結果揭示了它們在圖像生成過程中有效整合和應用世界知識的能力存在顯著限制，突顯了在下一代T2I模型中增強知識整合和應用的關鍵途徑。代碼和數據可在https://github.com/PKU-YuanGroup/WISE獲取。

English

Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text to image generation. To address this challenge, we propose WISE, the first benchmark specifically designed for World Knowledge-Informed Semantic Evaluation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 sub-domains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce WiScore, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at https://github.com/PKU-YuanGroup/WISE.

WISE：面向文本到圖像生成的世界知識語義評估

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

摘要

Support