CMMMU: 중국어 대규모 다학제 멀티모달 이해 벤치마크

초록

대규모 멀티모달 모델(LMMs)의 능력이 지속적으로 발전함에 따라, LMMs의 성능을 평가하는 필요성이 점점 더 커지고 있습니다. 또한, 중국어와 같은 비영어권 환경에서 LMMs의 고급 지식과 추론 능력을 평가하는 데는 더 큰 격차가 존재합니다. 우리는 중국어 환경에서 대학 수준의 학문적 지식과 신중한 추론을 요구하는 과제에서 LMMs를 평가하기 위해 설계된 새로운 벤치마크인 CMMMU(Chinese Massive Multi-discipline Multimodal Understanding)를 소개합니다. CMMMU는 MMMU의 주석 및 분석 패턴을 엄격히 따르며 그에 영감을 받았습니다. CMMMU는 대학 시험, 퀴즈, 교과서에서 수동으로 수집된 12,000개의 멀티모달 질문을 포함하며, Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, Tech & Engineering 등 6개의 핵심 학문 분야를 다룹니다. 이는 MMMU와 유사합니다. 이러한 질문은 30개의 주제를 아우르며 차트, 다이어그램, 지도, 표, 악보, 화학 구조 등 39개의 매우 이질적인 이미지 유형으로 구성됩니다. CMMMU는 중국어 환경에서 도메인 특화 지식을 활용한 복잡한 인지와 추론에 초점을 맞춥니다. 우리는 11개의 오픈소스 LLM과 하나의 독점 모델인 GPT-4V(ision)를 평가했습니다. 심지어 GPT-4V도 42%의 정확도만 달성하여 개선의 여지가 크다는 것을 보여줍니다. CMMMU는 전문가 수준의 인공지능을 향한 차세대 LMMs를 구축하고 다양한 언어 환경을 제공함으로써 LMMs의 민주화를 촉진할 것입니다.

English

As the capabilities of large multimodal models (LMMs) continue to advance, evaluating the performance of LMMs emerges as an increasing need. Additionally, there is an even larger gap in evaluating the advanced knowledge and reasoning abilities of LMMs in non-English contexts such as Chinese. We introduce CMMMU, a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context. CMMMU is inspired by and strictly follows the annotation and analysis pattern of MMMU. CMMMU includes 12k manually collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering, like its companion, MMMU. These questions span 30 subjects and comprise 39 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. CMMMU focuses on complex perception and reasoning with domain-specific knowledge in the Chinese context. We evaluate 11 open-source LLMs and one proprietary GPT-4V(ision). Even GPT-4V only achieves accuracies of 42%, indicating a large space for improvement. CMMMU will boost the community to build the next-generation LMMs towards expert artificial intelligence and promote the democratization of LMMs by providing diverse language contexts.

CMMMU: 중국어 대규모 다학제 멀티모달 이해 벤치마크

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

초록

Support