ImagenWorld: Stress Test dei Modelli di Generazione di Immagini con Valutazione Umana Spiegabile su Compiti Aperti del Mondo Reale

Abstract

I progressi nei modelli diffusion, autoregressivi e ibridi hanno abilitato la sintesi di immagini di alta qualità per compiti come text-to-image, editing e composizione guidata da riferimenti. Tuttavia, i benchmark esistenti rimangono limitati, concentrandosi su compiti isolati, coprendo solo domini ristretti o fornendo punteggi opachi senza spiegare le modalità di fallimento. Introduciamo ImagenWorld, un benchmark di 3.6K set di condizioni che abbraccia sei compiti fondamentali (generazione e editing, con riferimenti singoli o multipli) e sei domini tematici (opere d'arte, immagini fotorealistiche, grafici informativi, grafica testuale, computer grafica e screenshot). Il benchmark è supportato da 20K annotazioni umane granulari e da uno schema di valutazione spiegabile che etichetta errori localizzati a livello di oggetto e di segmento, integrando le metriche automatizzate basate su VLM. La nostra valutazione su larga scala di 14 modelli produce diversi insight: (1) i modelli generalmente faticano più nei compiti di editing che in quelli di generazione, specialmente negli editing locali. (2) i modelli eccellono in contesti artistici e fotorealistici ma lottano con domini simbolici e ricchi di testo come screenshot e grafici informativi. (3) i sistemi closed-source guidano la classifica generale, mentre una curatela dei dati mirata (ad es. Qwen-Image) riduce il divario nei casi ricchi di testo. (4) le moderne metriche basate su VLM raggiungono accuratezze di Kendall fino a 0.79, avvicinandosi al ranking umano, ma sono carenti nell'attribuzione di errori granulare e spiegabile. ImagenWorld fornisce sia un benchmark rigoroso che uno strumento diagnostico per far progredire la generazione robusta di immagini.

English

Advances in diffusion, autoregressive, and hybrid models have enabled high-quality image synthesis for tasks such as text-to-image, editing, and reference-guided composition. Yet, existing benchmarks remain limited, either focus on isolated tasks, cover only narrow domains, or provide opaque scores without explaining failure modes. We introduce ImagenWorld, a benchmark of 3.6K condition sets spanning six core tasks (generation and editing, with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots). The benchmark is supported by 20K fine-grained human annotations and an explainable evaluation schema that tags localized object-level and segment-level errors, complementing automated VLM-based metrics. Our large-scale evaluation of 14 models yields several insights: (1) models typically struggle more in editing tasks than in generation tasks, especially in local edits. (2) models excel in artistic and photorealistic settings but struggle with symbolic and text-heavy domains such as screenshots and information graphics. (3) closed-source systems lead overall, while targeted data curation (e.g., Qwen-Image) narrows the gap in text-heavy cases. (4) modern VLM-based metrics achieve Kendall accuracies up to 0.79, approximating human ranking, but fall short of fine-grained, explainable error attribution. ImagenWorld provides both a rigorous benchmark and a diagnostic tool to advance robust image generation.

ImagenWorld: Stress Test dei Modelli di Generazione di Immagini con Valutazione Umana Spiegabile su Compiti Aperti del Mondo Reale

ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

Abstract

Support