ChatPaper.aiChatPaper

GenExam:一项跨学科的文生图考试

GenExam: A Multidisciplinary Text-to-Image Exam

September 17, 2025
作者: Zhaokai Wang, Penghao Yin, Xiangyu Zhao, Changyao Tian, Yu Qiao, Wenhai Wang, Jifeng Dai, Gen Luo
cs.AI

摘要

考试是检验专家级智能的核心手段,要求综合理解、推理与生成能力。现有的考试类基准主要聚焦于理解与推理任务,而当前的生成基准则侧重于展现世界知识与视觉概念,忽视了对严格绘图考试的评估。我们推出了GenExam,这是首个面向多学科文本到图像考试的基准,包含10个学科的1000个样本,采用考试风格提示,并按四级分类体系组织。每个问题均配有真实图像与细粒度评分点,以实现对语义准确性与视觉合理性的精确评估。实验表明,即便是GPT-Image-1和Gemini-2.5-Flash-Image等顶尖模型,其严格得分也低于15%,多数模型几乎得分为0%,凸显了我们基准的巨大挑战。通过将图像生成视为考试,GenExam为模型整合知识、推理与生成的能力提供了严格评估,为通向通用人工智能(AGI)的道路提供了洞见。
English
Exams are a fundamental test of expert-level intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly focus on understanding and reasoning tasks, and current generation benchmarks emphasize the illustration of world knowledge and visual concepts, neglecting the evaluation of rigorous drawing exams. We introduce GenExam, the first benchmark for multidisciplinary text-to-image exams, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points to enable a precise evaluation of semantic correctness and visual plausibility. Experiments show that even state-of-the-art models such as GPT-Image-1 and Gemini-2.5-Flash-Image achieve less than 15% strict scores, and most models yield almost 0%, suggesting the great challenge of our benchmark. By framing image generation as an exam, GenExam offers a rigorous assessment of models' ability to integrate knowledge, reasoning, and generation, providing insights on the path to general AGI.
PDF212September 18, 2025