JMMMU-Pro: 画像ベースの日本多分野マルチモーダル理解ベンチマーク（Vibe Benchmark Construction経由）

要旨

本論文では、画像ベースの日本語学際的マルチモーダル理解ベンチマークであるJMMMU-Proと、スケーラブルな構築手法であるVibe Benchmark Constructionを提案する。MMMUからMMMU-Proへの進化に続き、JMMMU-ProはJMMMUを拡張し、問題画像と問題文を単一の画像に統合することで、視覚的知覚を通じた統合的な視覚-テキスト理解を必要とするベンチマークを構築する。JMMMU-Pro構築のため、画像生成モデル（例：Nano Banana Pro）が候補となる視覚問題を生成し、人間が出力を検証、必要に応じて調整したプロンプトで再生成することで品質を保証するVibe Benchmark Construction手法を提案する。Nano Banana Proの高精細な画像生成能力とクリーンな日本語テキスト埋め込み機能を活用し、多様な背景とレイアウトデザインを網羅した高品質ベンチマークを低コストで構築する。実験結果では、全てのオープンソースLMMがJMMMU-Proに著しく苦戦し、オープンソースコミュニティの将来の発展を導く重要なベンチマークであることを示唆する。JMMMU-ProはLMMの日本語能力評価におけるより厳格な評価ツールを提供し、Vibe Benchmark Constructionは画像ベースVQAベンチマークの将来の開発に対する効率的な指針となると考える。

English

This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scalable construction method. Following the evolution from MMMU to MMMU-Pro, JMMMU-Pro extends JMMMU by composing the question image and question text into a single image, thereby creating a benchmark that requires integrated visual-textual understanding through visual perception. To build JMMMU-Pro, we propose Vibe Benchmark Construction, a methodology in which an image generative model (e.g., Nano Banana Pro) produces candidate visual questions, and humans verify the outputs and, when necessary, regenerate with adjusted prompts to ensure quality. By leveraging Nano Banana Pro's highly realistic image generation capabilities and its ability to embed clean Japanese text, we construct a high-quality benchmark at low cost, covering a wide range of background and layout designs. Experimental results show that all open-source LMMs struggle substantially with JMMMU-Pro, underscoring JMMMU-Pro as an important benchmark for guiding future efforts in the open-source community. We believe that JMMMU-Pro provides a more rigorous evaluation tool for assessing the Japanese capabilities of LMMs and that our Vibe Benchmark Construction also offers an efficient guideline for future development of image-based VQA benchmarks.

JMMMU-Pro: 画像ベースの日本多分野マルチモーダル理解ベンチマーク（Vibe Benchmark Construction経由）

JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction

要旨

Support