CMMMU：中文大規模多學科多模懂得基準。

摘要

隨著大型多模型（LMMs）的能力不斷提升，評估LMMs的表現變得日益重要。此外，在非英語語境（如中文）中評估LMMs的先進知識和推理能力存在更大的差距。我們介紹了CMMMU，一個新的中文大型多學科多模理解基準，旨在評估LMMs在中文語境中要求大學水平學科知識和深思熟慮推理的任務。CMMMU受MMMUs的標註和分析模式啟發並嚴格遵循之。 CMMMU包括來自大學考試、小測驗和教科書的1.2萬個手動收集的多模問題，涵蓋六個核心學科：藝術與設計、商業、科學、健康與醫學、人文與社會科學以及科技與工程，類似於其同伴MMMUs。這些問題涵蓋30個學科，包括39種高度異質的圖像類型，如圖表、圖解、地圖、表格、樂譜和化學結構。 CMMMU專注於中文語境中具有領域特定知識的複雜感知和推理。我們評估了11個開源LMMs和一個專有的GPT-4V(ision)。即使GPT-4V也僅實現了42％的準確率，表明還有很大的改進空間。CMMMU將推動社區構建面向專家人工智能的下一代LMMs，並通過提供多樣化的語言語境促進LMMs的民主化。

English

As the capabilities of large multimodal models (LMMs) continue to advance, evaluating the performance of LMMs emerges as an increasing need. Additionally, there is an even larger gap in evaluating the advanced knowledge and reasoning abilities of LMMs in non-English contexts such as Chinese. We introduce CMMMU, a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context. CMMMU is inspired by and strictly follows the annotation and analysis pattern of MMMU. CMMMU includes 12k manually collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering, like its companion, MMMU. These questions span 30 subjects and comprise 39 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. CMMMU focuses on complex perception and reasoning with domain-specific knowledge in the Chinese context. We evaluate 11 open-source LLMs and one proprietary GPT-4V(ision). Even GPT-4V only achieves accuracies of 42%, indicating a large space for improvement. CMMMU will boost the community to build the next-generation LMMs towards expert artificial intelligence and promote the democratization of LMMs by providing diverse language contexts.

CMMMU：中文大規模多學科多模懂得基準。

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

摘要

Support