MM-Vet v2: 大規模マルチモーダルモデルの統合能力を評価するための挑戦的なベンチマーク

要旨

MM-Vetは、統合的な能力を評価するためのオープンエンドな視覚言語質問を対象としており、大規模マルチモーダルモデルの評価において最も人気のあるベンチマークの一つとなっています。MM-Vetは、認識、知識、空間認識、言語生成、OCR、数学という6つのコアな視覚言語（VL）能力を評価します。しかし、その質問形式は単一の画像-テキストペアに限定されており、現実世界のシナリオで一般的な画像とテキストの交互に現れるシーケンスを欠いています。この制限を解決するため、我々はMM-Vet v2を導入し、「画像-テキストシーケンス理解」という新しいVL能力を追加しました。これにより、モデルがVLシーケンスを処理する能力を評価します。さらに、評価サンプルの高品質を維持しつつ、評価セットのサイズをさらに拡大しました。MM-Vet v2を使用して大規模マルチモーダルモデルをベンチマークした結果、Claude 3.5 Sonnetが71.8のスコアで最高のモデルとなり、71.0のスコアを記録したGPT-4oをわずかに上回りました。オープンウェイトモデルの中では、InternVL2-Llama3-76Bが68.4のスコアで首位を占めました。

English

MM-Vet, with open-ended vision-language questions targeting at evaluating integrated capabilities, has become one of the most popular benchmarks for large multimodal model evaluation. MM-Vet assesses six core vision-language (VL) capabilities: recognition, knowledge, spatial awareness, language generation, OCR, and math. However, its question format is restricted to single image-text pairs, lacking the interleaved image and text sequences prevalent in real-world scenarios. To address this limitation, we introduce MM-Vet v2, which includes a new VL capability called "image-text sequence understanding", evaluating models' ability to process VL sequences. Furthermore, we maintain the high quality of evaluation samples while further expanding the evaluation set size. Using MM-Vet v2 to benchmark large multimodal models, we found that Claude 3.5 Sonnet is the best model with a score of 71.8, slightly outperforming GPT-4o which scored 71.0. Among open-weight models, InternVL2-Llama3-76B leads with a score of 68.4.

MM-Vet v2: 大規模マルチモーダルモデルの統合能力を評価するための挑戦的なベンチマーク

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities

要旨

Support