ImagenWorld：基于开放式真实场景任务的可解释性人工评估对图像生成模型进行压力测试

摘要

扩散模型、自回归模型及混合模型的技术进步，已实现文本生成图像、图像编辑和参考引导合成等任务的高质量图像生成。然而现有基准测试仍存在局限：或聚焦孤立任务，或仅覆盖狭窄领域，或提供难以解释失败原因的不透明评分。我们推出ImagenWorld基准测试集，包含涵盖六大核心任务（单/多参考条件下的生成与编辑）和六大主题领域（艺术作品、逼真图像、信息图表、文本图形、计算机图形及屏幕截图）的3600组条件设置。该基准集配备2万条细粒度人工标注和可解释的评估框架，通过标记局部物体级与片段级错误，对基于视觉语言模型的自动评估指标形成补充。我们对14个模型开展的大规模评估得出以下发现：（1）模型在编辑任务（尤其是局部编辑）中的表现普遍弱于生成任务；（2）模型在艺术和逼真场景中表现优异，但在屏幕截图、信息图表等符号密集和文本密集型领域存在困难；（3）闭源系统整体领先，而针对性数据优化（如Qwen-Image）能在文本密集型场景中缩小差距；（4）基于现代视觉语言模型的评估指标肯德尔系数最高达0.79，接近人类排序水平，但在细粒度可解释错误归因方面仍有不足。ImagenWorld既提供了严谨的基准测试标准，也具备诊断工具功能，可推动鲁棒图像生成技术的发展。

English

Advances in diffusion, autoregressive, and hybrid models have enabled high-quality image synthesis for tasks such as text-to-image, editing, and reference-guided composition. Yet, existing benchmarks remain limited, either focus on isolated tasks, cover only narrow domains, or provide opaque scores without explaining failure modes. We introduce ImagenWorld, a benchmark of 3.6K condition sets spanning six core tasks (generation and editing, with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots). The benchmark is supported by 20K fine-grained human annotations and an explainable evaluation schema that tags localized object-level and segment-level errors, complementing automated VLM-based metrics. Our large-scale evaluation of 14 models yields several insights: (1) models typically struggle more in editing tasks than in generation tasks, especially in local edits. (2) models excel in artistic and photorealistic settings but struggle with symbolic and text-heavy domains such as screenshots and information graphics. (3) closed-source systems lead overall, while targeted data curation (e.g., Qwen-Image) narrows the gap in text-heavy cases. (4) modern VLM-based metrics achieve Kendall accuracies up to 0.79, approximating human ranking, but fall short of fine-grained, explainable error attribution. ImagenWorld provides both a rigorous benchmark and a diagnostic tool to advance robust image generation.