ChatPaper.aiChatPaper

UniGenBench++:文本到圖像生成的統一語義評估基準

UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation

October 21, 2025
作者: Yibin Wang, Zhimin Li, Yuhang Zang, Jiazi Bu, Yujie Zhou, Yi Xin, Junjun He, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang
cs.AI

摘要

近期在文本到圖像(T2I)生成領域的進展,凸顯了可靠基準在評估生成圖像如何準確反映其文本提示語義方面的重要性。然而,(1) 現有基準缺乏多樣化的提示場景和多語言支持,這兩者對於實際應用至關重要;(2) 它們僅提供主要維度的粗略評估,涵蓋的子維度範圍狹窄,且在細粒度子維度評估上有所不足。為解決這些限制,我們引入了UniGenBench++,一個針對T2I生成的統一語義評估基準。具體而言,它包含600個按層次組織的提示,以確保覆蓋面和效率:(1) 跨越多樣化的現實場景,即5個主要提示主題和20個子主題;(2) 全面探測T2I模型在10個主要和27個子評估標準上的語義一致性,每個提示評估多個測試點。為了嚴格評估模型對語言和提示長度變化的魯棒性,我們提供了每個提示的英文和中文版本,包括短版和長版。利用閉源多模態大語言模型(MLLM),即Gemini-2.5-Pro,其廣泛的世界知識和細粒度圖像理解能力,我們開發了一個有效的管道,用於可靠的基準構建和簡化的模型評估。此外,為了進一步促進社區使用,我們訓練了一個魯棒的評估模型,使得能夠離線評估T2I模型的輸出。通過對開源和閉源T2I模型的全面基準測試,我們系統地揭示了它們在各個方面的優勢和劣勢。
English
Recent progress in text-to-image (T2I) generation underscores the importance of reliable benchmarks in evaluating how accurately generated images reflect the semantics of their textual prompt. However, (1) existing benchmarks lack the diversity of prompt scenarios and multilingual support, both essential for real-world applicability; (2) they offer only coarse evaluations across primary dimensions, covering a narrow range of sub-dimensions, and fall short in fine-grained sub-dimension assessment. To address these limitations, we introduce UniGenBench++, a unified semantic assessment benchmark for T2I generation. Specifically, it comprises 600 prompts organized hierarchically to ensure both coverage and efficiency: (1) spans across diverse real-world scenarios, i.e., 5 main prompt themes and 20 subthemes; (2) comprehensively probes T2I models' semantic consistency over 10 primary and 27 sub evaluation criteria, with each prompt assessing multiple testpoints. To rigorously assess model robustness to variations in language and prompt length, we provide both English and Chinese versions of each prompt in short and long forms. Leveraging the general world knowledge and fine-grained image understanding capabilities of a closed-source Multi-modal Large Language Model (MLLM), i.e., Gemini-2.5-Pro, an effective pipeline is developed for reliable benchmark construction and streamlined model assessment. Moreover, to further facilitate community use, we train a robust evaluation model that enables offline assessment of T2I model outputs. Through comprehensive benchmarking of both open- and closed-sourced T2I models, we systematically reveal their strengths and weaknesses across various aspects.
PDF632October 22, 2025