GenExam：一項跨學科的文本到圖像考試

摘要

考試是對專家級智能的根本性考驗，要求綜合的理解、推理和生成能力。現有的考試式基準主要集中於理解和推理任務，而當前的生成基準則強調世界知識和視覺概念的展示，忽視了對嚴格繪圖考試的評估。我們引入了GenExam，這是首個跨學科文本到圖像考試的基準，包含10個學科的1000個樣本，並按照四級分類法組織考試式提示。每個問題都配備了真實圖像和細粒度評分點，以實現對語義正確性和視覺合理性的精確評估。實驗表明，即使是如GPT-Image-1和Gemini-2.5-Flash-Image這樣的頂尖模型，其嚴格得分也低於15%，而大多數模型的得分幾乎為0%，這表明我們的基準具有極大的挑戰性。通過將圖像生成框架為考試，GenExam提供了對模型整合知識、推理和生成能力的嚴格評估，為通向通用人工智慧（AGI）的道路提供了洞見。

English

Exams are a fundamental test of expert-level intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly focus on understanding and reasoning tasks, and current generation benchmarks emphasize the illustration of world knowledge and visual concepts, neglecting the evaluation of rigorous drawing exams. We introduce GenExam, the first benchmark for multidisciplinary text-to-image exams, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points to enable a precise evaluation of semantic correctness and visual plausibility. Experiments show that even state-of-the-art models such as GPT-Image-1 and Gemini-2.5-Flash-Image achieve less than 15% strict scores, and most models yield almost 0%, suggesting the great challenge of our benchmark. By framing image generation as an exam, GenExam offers a rigorous assessment of models' ability to integrate knowledge, reasoning, and generation, providing insights on the path to general AGI.

GenExam：一項跨學科的文本到圖像考試

GenExam: A Multidisciplinary Text-to-Image Exam

摘要

Support