JMMMU：一个用于文化感知评估的日本大规模多学科多模态理解基准。

摘要

加快在非英语语言中对大型多模态模型（LMMs）的研究对于提升更广泛人群的用户体验至关重要。本文介绍了JMMMU（日本MMMU），这是第一个大规模日语基准，旨在根据日本文化背景设计评估LMMs在专家级任务上的性能。为促进全面的文化感知评估，JMMMU包括两个互补的子集：（i）文化无关（CA）子集，选择并将与文化无关的主题（例如数学）翻译成日语，从而实现与其英文对应物MMM的一对一比较；以及（ii）文化特定（CS）子集，包括反映日本文化背景的新主题。使用CA子集，我们观察到许多LMMs在日语环境下的表现下降，这纯粹归因于语言变化。使用CS子集，我们揭示了它们对日本文化的不足理解。此外，通过结合两个子集，我们发现一些LMMs在CA子集上表现良好，但在CS子集上表现不佳，暴露了对日语的理解浅薄，缺乏文化理解的深度。我们希望这项工作不仅有助于提升LMM在日语中的性能，还能作为创建高标准、文化多样的多语言LMM开发基准的指南。项目页面链接为https://mmmu-japanese-benchmark.github.io/JMMMU/。

English

Accelerating research on Large Multimodal Models (LMMs) in non-English languages is crucial for enhancing user experiences across broader populations. In this paper, we introduce JMMMU (Japanese MMMU), the first large-scale Japanese benchmark designed to evaluate LMMs on expert-level tasks based on the Japanese cultural context. To facilitate comprehensive culture-aware evaluation, JMMMU features two complementary subsets: (i) culture-agnostic (CA) subset, where the culture-independent subjects (e.g., Math) are selected and translated into Japanese, enabling one-to-one comparison with its English counterpart MMMU; and (ii) culture-specific (CS) subset, comprising newly crafted subjects that reflect Japanese cultural context. Using the CA subset, we observe performance drop in many LMMs when evaluated in Japanese, which is purely attributable to language variation. Using the CS subset, we reveal their inadequate Japanese cultural understanding. Further, by combining both subsets, we identify that some LMMs perform well on the CA subset but not on the CS subset, exposing a shallow understanding of the Japanese language that lacks depth in cultural understanding. We hope this work will not only help advance LMM performance in Japanese but also serve as a guideline to create high-standard, culturally diverse benchmarks for multilingual LMM development. The project page is https://mmmu-japanese-benchmark.github.io/JMMMU/.

JMMMU：一个用于文化感知评估的日本大规模多学科多模态理解基准。

JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation

摘要

Support