OneIG-Bench:面向圖像生成的全維度細粒度評估框架
OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation
June 9, 2025
作者: Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, Hai-Bao Chen
cs.AI
摘要
文本到圖像(T2I)模型因能生成與文字提示高度契合的高質量圖像而備受關注。然而,T2I模型的快速發展揭示了早期基準測試的侷限性,這些測試缺乏全面的評估,例如在推理、文字渲染和風格方面的評估。值得注意的是,近期最先進的模型憑藉其豐富的知識建模能力,在需要強大推理能力的圖像生成問題上展現出令人期待的成果,但現有的評估體系尚未充分應對這一前沿領域。為系統性地彌補這些不足,我們推出了OneIG-Bench,這是一個精心設計的綜合基準框架,用於從多個維度對T2I模型進行細粒度評估,包括提示-圖像對齊、文字渲染精度、推理生成內容、風格化及多樣性。通過結構化的評估,該基準能夠深入分析模型性能,幫助研究人員和實踐者精確定位圖像生成全流程中的優勢與瓶頸。具體而言,OneIG-Bench允許用戶聚焦於特定的評估子集,從而實現靈活的評估。用戶無需為所有提示生成圖像,而僅需為選定維度相關的提示生成圖像,並據此完成相應的評估。我們的代碼庫和數據集現已公開,旨在促進T2I研究領域內的可重複評估研究與跨模型比較。
English
Text-to-image (T2I) models have garnered significant attention for generating
high-quality images aligned with text prompts. However, rapid T2I model
advancements reveal limitations in early benchmarks, lacking comprehensive
evaluations, for example, the evaluation on reasoning, text rendering and
style. Notably, recent state-of-the-art models, with their rich knowledge
modeling capabilities, show promising results on the image generation problems
requiring strong reasoning ability, yet existing evaluation systems have not
adequately addressed this frontier. To systematically address these gaps, we
introduce OneIG-Bench, a meticulously designed comprehensive benchmark
framework for fine-grained evaluation of T2I models across multiple dimensions,
including prompt-image alignment, text rendering precision, reasoning-generated
content, stylization, and diversity. By structuring the evaluation, this
benchmark enables in-depth analysis of model performance, helping researchers
and practitioners pinpoint strengths and bottlenecks in the full pipeline of
image generation. Specifically, OneIG-Bench enables flexible evaluation by
allowing users to focus on a particular evaluation subset. Instead of
generating images for the entire set of prompts, users can generate images only
for the prompts associated with the selected dimension and complete the
corresponding evaluation accordingly. Our codebase and dataset are now publicly
available to facilitate reproducible evaluation studies and cross-model
comparisons within the T2I research community.