ChatPaper.aiChatPaper

JMMMU-Pro:基于Vibe基准构建的图像驱动型日语多学科多模态理解基准

JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction

December 16, 2025
作者: Atsuyuki Miyai, Shota Onohara, Jeonghun Baek, Kiyoharu Aizawa
cs.AI

摘要

本文介绍了JMMMU-Pro(基于图像的日本多学科多模态理解基准)以及可扩展构建方法Vibe Benchmark Construction。遵循从MMMU到MMMU-Pro的演进路径,JMMMU-Pro将JMMMU的问题图像与问题文本整合为单一图像,从而构建出需要通过视觉感知进行图文融合理解的评测基准。为构建JMMMU-Pro,我们提出Vibe Benchmark Construction方法:通过图像生成模型(如Nano Banana Pro)生成候选视觉问题,再由人工验证输出结果并在必要时调整提示词重新生成,以确保质量。借助Nano Banana Pro高真实度的图像生成能力与纯净日文文本嵌入特性,我们以低成本构建了涵盖多样化背景与版式设计的高质量基准。实验结果表明,所有开源LMM在JMMMU-Pro上均表现不佳,这凸显了该基准对指导开源社区未来发展的重要价值。我们相信JMMMU-Pro为评估LMM的日语能力提供了更严谨的工具,同时Vibe Benchmark Construction也为未来开发基于图像的视觉问答基准提供了高效指南。
English
This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scalable construction method. Following the evolution from MMMU to MMMU-Pro, JMMMU-Pro extends JMMMU by composing the question image and question text into a single image, thereby creating a benchmark that requires integrated visual-textual understanding through visual perception. To build JMMMU-Pro, we propose Vibe Benchmark Construction, a methodology in which an image generative model (e.g., Nano Banana Pro) produces candidate visual questions, and humans verify the outputs and, when necessary, regenerate with adjusted prompts to ensure quality. By leveraging Nano Banana Pro's highly realistic image generation capabilities and its ability to embed clean Japanese text, we construct a high-quality benchmark at low cost, covering a wide range of background and layout designs. Experimental results show that all open-source LMMs struggle substantially with JMMMU-Pro, underscoring JMMMU-Pro as an important benchmark for guiding future efforts in the open-source community. We believe that JMMMU-Pro provides a more rigorous evaluation tool for assessing the Japanese capabilities of LMMs and that our Vibe Benchmark Construction also offers an efficient guideline for future development of image-based VQA benchmarks.
PDF11December 18, 2025