MEENA(波斯MMMU):面向N級評估的多模態多語言教育考試
MEENA (PersianMMMU): Multimodal-Multilingual Educational Exams for N-level Assessment
August 24, 2025
作者: Omid Ghahroodi, Arshia Hemmat, Marzia Nouri, Seyed Mohammad Hadi Hosseini, Doratossadat Dastgheib, Mohammad Vali Sanian, Alireza Sahebi, Reihaneh Zohrabi, Mohammad Hossein Rohban, Ehsaneddin Asgari, Mahdieh Soleymani Baghshah
cs.AI
摘要
近期大型視覺語言模型(VLMs)的進展主要集中在英語領域,對其他語言的關注相對有限。為填補這一空白,我們推出了MEENA(亦稱PersianMMMU),這是首個專為評估波斯語VLMs在科學、推理及人類層次理解任務上表現而設計的數據集。該數據集包含約7,500道波斯語及3,000道英語問題,涵蓋推理、數學、物理、圖表、以及波斯藝術與文學等多樣主題。MEENA的關鍵特徵包括:(1) 跨越多個教育階段(從小學至高級中學)的廣泛學科覆蓋,(2) 包含難度等級與詳解答案的豐富元數據,(3) 保留文化細微差別的原創波斯語數據,(4) 雙語結構以評估跨語言表現,以及(5) 一系列多樣化實驗,評估包括整體性能、模型對圖像的關注能力及其產生幻覺傾向在內的多種能力。我們期望此基準能助力提升VLMs在英語之外的能力。
English
Recent advancements in large vision-language models (VLMs) have primarily
focused on English, with limited attention given to other languages. To address
this gap, we introduce MEENA (also known as PersianMMMU), the first dataset
designed to evaluate Persian VLMs across scientific, reasoning, and human-level
understanding tasks. Our dataset comprises approximately 7,500 Persian and
3,000 English questions, covering a wide range of topics such as reasoning,
mathematics, physics, diagrams, charts, and Persian art and literature. Key
features of MEENA include: (1) diverse subject coverage spanning various
educational levels, from primary to upper secondary school, (2) rich metadata,
including difficulty levels and descriptive answers, (3) original Persian data
that preserves cultural nuances, (4) a bilingual structure to assess
cross-linguistic performance, and (5) a series of diverse experiments assessing
various capabilities, including overall performance, the model's ability to
attend to images, and its tendency to generate hallucinations. We hope this
benchmark contributes to enhancing VLM capabilities beyond English.