WorldGenBench：一個整合世界知識的基準測試，用於推理驅動的文本到圖像生成

摘要

近期，文本到圖像（T2I）生成技術取得了顯著進展，然而現有模型在處理需要豐富世界知識和隱含推理的提示時仍面臨挑戰：這兩者對於在現實場景中生成語義準確、連貫且上下文適宜的圖像至關重要。為填補這一空白，我們引入了WorldGenBench，這是一個旨在系統評估T2I模型世界知識基礎和隱含推理能力的基準，涵蓋人文與自然領域。我們提出了知識清單評分（Knowledge Checklist Score），這是一種結構化指標，用於衡量生成圖像滿足關鍵語義期望的程度。對21個最先進模型的實驗表明，儘管擴散模型在開源方法中領先，但像GPT-4o這樣的專有自迴歸模型展現出更強的推理和知識整合能力。我們的研究結果強調了下一代T2I系統需要具備更深層次的理解和推理能力。項目頁面：https://dwanzhang-ai.github.io/WorldGenBench/{https://dwanzhang-ai.github.io/WorldGenBench/}

English

Recent advances in text-to-image (T2I) generation have achieved impressive results, yet existing models still struggle with prompts that require rich world knowledge and implicit reasoning: both of which are critical for producing semantically accurate, coherent, and contextually appropriate images in real-world scenarios. To address this gap, we introduce WorldGenBench, a benchmark designed to systematically evaluate T2I models' world knowledge grounding and implicit inferential capabilities, covering both the humanities and nature domains. We propose the Knowledge Checklist Score, a structured metric that measures how well generated images satisfy key semantic expectations. Experiments across 21 state-of-the-art models reveal that while diffusion models lead among open-source methods, proprietary auto-regressive models like GPT-4o exhibit significantly stronger reasoning and knowledge integration. Our findings highlight the need for deeper understanding and inference capabilities in next-generation T2I systems. Project Page: https://dwanzhang-ai.github.io/WorldGenBench/{https://dwanzhang-ai.github.io/WorldGenBench/}