ChatPaper.aiChatPaper

ImagenWorld:基于开放式真实任务的可解释人工评估对图像生成模型进行压力测试

ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

March 29, 2026
作者: Samin Mahdizadeh Sani, Max Ku, Nima Jamali, Matina Mahdizadeh Sani, Paria Khoshtab, Wei-Chieh Sun, Parnian Fazel, Zhi Rui Tam, Thomas Chong, Edisy Kin Wai Chan, Donald Wai Tong Tsang, Chiao-Wei Hsu, Ting Wai Lam, Ho Yin Sam Ng, Chiafeng Chu, Chak-Wing Mak, Keming Wu, Hiu Tung Wong, Yik Chun Ho, Chi Ruan, Zhuofeng Li, I-Sheng Fang, Shih-Ying Yeh, Ho Kei Cheng, Ping Nie, Wenhu Chen
cs.AI

摘要

扩散模型、自回归模型及混合模型的技术进展,已能实现文本生成图像、图像编辑和参考图组合等任务的高质量图像合成。然而现有基准测试仍存在局限:要么聚焦孤立任务,要么仅覆盖狭窄领域,或提供不透明的评分却未解释失败模式。我们推出ImagenWorld基准测试集,包含涵盖六大核心任务(单/多参考图的生成与编辑)和六大主题领域(艺术作品、逼真图像、信息图表、文字图形、计算机图形及屏幕截图)的3600组条件设置。该基准配备2万条细粒度人工标注和可解释的评估框架,通过标记局部物体级和片段级错误来补充基于视觉语言模型的自动化指标。我们对14个模型的大规模评估得出多项发现:(1)模型在编辑任务(尤其是局部编辑)中的表现普遍弱于生成任务;(2)模型在艺术性和逼真场景中表现优异,但在屏幕截图、信息图表等符号密集和文本密集型领域存在困难;(3)闭源系统整体领先,而针对性数据优化(如Qwen-Image)能在文本密集型场景中缩小差距;(4)基于现代视觉语言模型的评估指标最高可达0.79的肯德尔精度,接近人类排序水平,但在细粒度可解释错误归因方面仍有不足。ImagenWorld既提供了严谨的基准测试标准,也为推进鲁棒图像生成技术提供了诊断工具。
English
Advances in diffusion, autoregressive, and hybrid models have enabled high-quality image synthesis for tasks such as text-to-image, editing, and reference-guided composition. Yet, existing benchmarks remain limited, either focus on isolated tasks, cover only narrow domains, or provide opaque scores without explaining failure modes. We introduce ImagenWorld, a benchmark of 3.6K condition sets spanning six core tasks (generation and editing, with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots). The benchmark is supported by 20K fine-grained human annotations and an explainable evaluation schema that tags localized object-level and segment-level errors, complementing automated VLM-based metrics. Our large-scale evaluation of 14 models yields several insights: (1) models typically struggle more in editing tasks than in generation tasks, especially in local edits. (2) models excel in artistic and photorealistic settings but struggle with symbolic and text-heavy domains such as screenshots and information graphics. (3) closed-source systems lead overall, while targeted data curation (e.g., Qwen-Image) narrows the gap in text-heavy cases. (4) modern VLM-based metrics achieve Kendall accuracies up to 0.79, approximating human ranking, but fall short of fine-grained, explainable error attribution. ImagenWorld provides both a rigorous benchmark and a diagnostic tool to advance robust image generation.
PDF171April 1, 2026