WorldGenBench: 추론 기반 텍스트-이미지 생성을 위한 세계 지식 통합 벤치마크

초록

텍스트-이미지(T2I) 생성 분야의 최근 발전은 인상적인 결과를 달성했지만, 기존 모델들은 여전히 풍부한 세계 지식과 암묵적 추론을 요구하는 프롬프트에서 어려움을 겪고 있습니다. 이 두 가지 요소는 실제 시나리오에서 의미적으로 정확하고 일관성 있으며 문맥에 적합한 이미지를 생성하는 데 매우 중요합니다. 이러한 격차를 해결하기 위해, 우리는 인문학과 자연 영역을 모두 아우르며 T2I 모델의 세계 지식 기반과 암묵적 추론 능력을 체계적으로 평가하기 위해 설계된 벤치마크인 WorldGenBench를 소개합니다. 우리는 생성된 이미지가 주요 의미적 기대를 얼마나 잘 충족시키는지를 측정하는 구조화된 지표인 지식 체크리스트 점수(Knowledge Checklist Score)를 제안합니다. 21개의 최신 모델에 대한 실험 결과, 디퓨전 모델이 오픈소스 방법론 중에서는 선두를 달리고 있지만, GPT-4o와 같은 독점적 자동회귀 모델이 훨씬 강력한 추론 및 지식 통합 능력을 보여주었습니다. 우리의 연구 결과는 차세대 T2I 시스템에서 더 깊은 이해와 추론 능력이 필요함을 강조합니다. 프로젝트 페이지: https://dwanzhang-ai.github.io/WorldGenBench/{https://dwanzhang-ai.github.io/WorldGenBench/}

English

Recent advances in text-to-image (T2I) generation have achieved impressive results, yet existing models still struggle with prompts that require rich world knowledge and implicit reasoning: both of which are critical for producing semantically accurate, coherent, and contextually appropriate images in real-world scenarios. To address this gap, we introduce WorldGenBench, a benchmark designed to systematically evaluate T2I models' world knowledge grounding and implicit inferential capabilities, covering both the humanities and nature domains. We propose the Knowledge Checklist Score, a structured metric that measures how well generated images satisfy key semantic expectations. Experiments across 21 state-of-the-art models reveal that while diffusion models lead among open-source methods, proprietary auto-regressive models like GPT-4o exhibit significantly stronger reasoning and knowledge integration. Our findings highlight the need for deeper understanding and inference capabilities in next-generation T2I systems. Project Page: https://dwanzhang-ai.github.io/WorldGenBench/{https://dwanzhang-ai.github.io/WorldGenBench/}

WorldGenBench: 추론 기반 텍스트-이미지 생성을 위한 세계 지식 통합 벤치마크

WorldGenBench: A World-Knowledge-Integrated Benchmark for Reasoning-Driven Text-to-Image Generation

초록

Support