ChatPaper.aiChatPaper

UniGenBench++:面向文本到图像生成的统一语义评估基准

UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation

October 21, 2025
作者: Yibin Wang, Zhimin Li, Yuhang Zang, Jiazi Bu, Yujie Zhou, Yi Xin, Junjun He, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang
cs.AI

摘要

近期,文本到图像(T2I)生成领域的进展凸显了可靠基准在评估生成图像如何准确反映文本提示语义方面的重要性。然而,现有基准存在以下不足:(1)缺乏多样化的提示场景和多语言支持,这两点对于实际应用至关重要;(2)仅提供主要维度的粗略评估,覆盖的子维度范围有限,且在细粒度子维度评估上表现不足。为应对这些局限,我们推出了UniGenBench++,一个面向T2I生成的统一语义评估基准。具体而言,它包含600个按层次结构组织的提示,以确保覆盖范围与效率:(1)跨越多样化的现实场景,即5个主要提示主题和20个子主题;(2)全面考察T2I模型在10个主要和27个子评估标准上的语义一致性,每个提示评估多个测试点。为严格评估模型对语言和提示长度变化的鲁棒性,我们提供了每个提示的英文和中文版本,包括简短和长形式。利用闭源多模态大语言模型(MLLM),即Gemini-2.5-Pro,其广泛的世界知识和细粒度图像理解能力,我们开发了一个有效的流程,用于可靠的基准构建和简化的模型评估。此外,为进一步促进社区使用,我们训练了一个稳健的评估模型,支持对T2I模型输出进行离线评估。通过对开源和闭源T2I模型的全面基准测试,我们系统地揭示了它们在不同方面的优势与不足。
English
Recent progress in text-to-image (T2I) generation underscores the importance of reliable benchmarks in evaluating how accurately generated images reflect the semantics of their textual prompt. However, (1) existing benchmarks lack the diversity of prompt scenarios and multilingual support, both essential for real-world applicability; (2) they offer only coarse evaluations across primary dimensions, covering a narrow range of sub-dimensions, and fall short in fine-grained sub-dimension assessment. To address these limitations, we introduce UniGenBench++, a unified semantic assessment benchmark for T2I generation. Specifically, it comprises 600 prompts organized hierarchically to ensure both coverage and efficiency: (1) spans across diverse real-world scenarios, i.e., 5 main prompt themes and 20 subthemes; (2) comprehensively probes T2I models' semantic consistency over 10 primary and 27 sub evaluation criteria, with each prompt assessing multiple testpoints. To rigorously assess model robustness to variations in language and prompt length, we provide both English and Chinese versions of each prompt in short and long forms. Leveraging the general world knowledge and fine-grained image understanding capabilities of a closed-source Multi-modal Large Language Model (MLLM), i.e., Gemini-2.5-Pro, an effective pipeline is developed for reliable benchmark construction and streamlined model assessment. Moreover, to further facilitate community use, we train a robust evaluation model that enables offline assessment of T2I model outputs. Through comprehensive benchmarking of both open- and closed-sourced T2I models, we systematically reveal their strengths and weaknesses across various aspects.
PDF632October 22, 2025