MM-Vet：統合能力のための大規模マルチモーダルモデルの評価

要旨

我々は、複雑なマルチモーダルタスクにおいて大規模マルチモーダルモデル（LMM）を評価するベンチマークMM-Vetを提案します。最近のLMMは、黒板に書かれた数学問題を解いたり、ニュース画像の出来事や有名人について推論したり、視覚的なジョークを説明したりするなど、さまざまな興味深い能力を示しています。モデルの急速な進歩は、評価ベンチマークの開発に課題を突きつけています。問題点は以下の通りです：（1）複雑なマルチモーダルタスクを体系的に構造化し評価する方法、（2）質問と回答のタイプを跨いで適切に機能する評価指標の設計方法、（3）単純な性能ランキングを超えたモデルの洞察を提供する方法です。これらを踏まえ、我々はMM-Vetを提示します。MM-Vetは、複雑なタスクを解決する興味深い能力が、異なるコア視覚言語（VL）能力を統合できる汎用モデルによって達成されるという洞察に基づいて設計されています。MM-Vetは6つのコアVL能力を定義し、それらの組み合わせから導出される16の統合を検証します。評価指標については、オープンエンドの出力を評価するためのLLMベースの評価器を提案します。この評価器は、異なる質問タイプや回答スタイルを跨いで評価を可能にし、統一されたスコアリング指標を提供します。我々は代表的なLMMをMM-Vetで評価し、異なるLMMシステムパラダイムやモデルの能力に関する洞察を提供します。コードとデータはhttps://github.com/yuweihao/MM-Vetで公開されています。

English

We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes. Rapid model advancements pose challenges to evaluation benchmark development. Problems include: (1) How to systematically structure and evaluate the complicated multimodal tasks; (2) How to design evaluation metrics that work well across question and answer types; and (3) How to give model insights beyond a simple performance ranking. To this end, we present MM-Vet, designed based on the insight that the intriguing ability to solve complicated tasks is often achieved by a generalist model being able to integrate different core vision-language (VL) capabilities. MM-Vet defines 6 core VL capabilities and examines the 16 integrations of interest derived from the capability combination. For evaluation metrics, we propose an LLM-based evaluator for open-ended outputs. The evaluator enables the evaluation across different question types and answer styles, resulting in a unified scoring metric. We evaluate representative LMMs on MM-Vet, providing insights into the capabilities of different LMM system paradigms and models. Code and data are available at https://github.com/yuweihao/MM-Vet.

MM-Vet：統合能力のための大規模マルチモーダルモデルの評価

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

要旨

Support