GenExam: 学際的テキスト・画像試験

要旨

試験は専門レベルの知能を測る基本的なテストであり、統合的な理解、推論、生成を必要とします。既存の試験形式のベンチマークは主に理解と推論タスクに焦点を当てており、現在の生成ベンチマークは世界知識や視覚的概念の描写を重視していますが、厳密な作図試験の評価は見過ごされています。本研究では、多分野にわたるテキストから画像への試験を対象とした初のベンチマーク「GenExam」を紹介します。GenExamは10科目にわたる1,000のサンプルを特徴とし、4段階の分類体系に基づいて整理された試験形式のプロンプトを提供します。各問題には正解画像と詳細な採点基準が備わっており、意味的正確性と視覚的妥当性を精密に評価することが可能です。実験結果から、GPT-Image-1やGemini-2.5-Flash-Imageといった最先端のモデルでさえ厳密なスコアが15%未満であり、ほとんどのモデルはほぼ0%に留まることが示され、本ベンチマークの難易度の高さが明らかになりました。画像生成を試験として捉えることで、GenExamは知識、推論、生成を統合するモデルの能力を厳密に評価し、汎用人工知能（AGI）への道筋に関する洞察を提供します。

English

Exams are a fundamental test of expert-level intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly focus on understanding and reasoning tasks, and current generation benchmarks emphasize the illustration of world knowledge and visual concepts, neglecting the evaluation of rigorous drawing exams. We introduce GenExam, the first benchmark for multidisciplinary text-to-image exams, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points to enable a precise evaluation of semantic correctness and visual plausibility. Experiments show that even state-of-the-art models such as GPT-Image-1 and Gemini-2.5-Flash-Image achieve less than 15% strict scores, and most models yield almost 0%, suggesting the great challenge of our benchmark. By framing image generation as an exam, GenExam offers a rigorous assessment of models' ability to integrate knowledge, reasoning, and generation, providing insights on the path to general AGI.

GenExam: 学際的テキスト・画像試験

GenExam: A Multidisciplinary Text-to-Image Exam

要旨

Support