SEED-Bench: 生成的解読能力によるマルチモーダルLLMのベンチマーキング

要旨

強力な大規模言語モデル（LLMs）を基盤として、最近の生成型マルチモーダル大規模言語モデル（MLLMs）は重要な研究分野として注目を集めており、理解と生成の両方において顕著な能力を示しています。本研究では、生成モデルの包括的評価に向けた第一歩として、MLLMsの生成的理解の評価に取り組み、SEED-Benchというベンチマークを導入します。SEED-Benchは、正確な人間による注釈を伴う19,000の多肢選択問題で構成されており（既存のベンチマークの6倍の規模）、画像と動画の両方のモダリティの理解を含む12の評価次元にわたっています。特定の評価次元を対象とした多肢選択問題を生成するための高度なパイプラインを開発し、自動フィルタリングと手動検証のプロセスを統合しています。人間による注釈に基づく正解オプションを伴う多肢選択問題は、評価中に人間やGPTの介入を必要とせず、モデルのパフォーマンスを客観的かつ効率的に評価することを可能にします。さらに、空間的および時間的理解を含む12の次元すべてにおいて、18のモデルのパフォーマンスを評価します。評価結果を通じて既存のMLLMsの限界を明らかにすることで、SEED-Benchが将来の研究を動機付けるための洞察を提供することを目指します。コミュニティがモデルの能力を評価し調査するためのプラットフォームを提供するため、リーダーボードを立ち上げ、一貫して維持していきます。

English

Based on powerful Large Language Models (LLMs), recent generative Multimodal Large Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting remarkable capability for both comprehension and generation. In this work, we address the evaluation of generative comprehension in MLLMs as a preliminary step towards a comprehensive assessment of generative models, by introducing a benchmark named SEED-Bench. SEED-Bench consists of 19K multiple choice questions with accurate human annotations (x 6 larger than existing benchmarks), which spans 12 evaluation dimensions including the comprehension of both the image and video modality. We develop an advanced pipeline for generating multiple-choice questions that target specific evaluation dimensions, integrating both automatic filtering and manual verification processes. Multiple-choice questions with groundtruth options derived from human annotation enables an objective and efficient assessment of model performance, eliminating the need for human or GPT intervention during evaluation. We further evaluate the performance of 18 models across all 12 dimensions, covering both the spatial and temporal understanding. By revealing the limitations of existing MLLMs through evaluation results, we aim for SEED-Bench to provide insights for motivating future research. We will launch and consistently maintain a leaderboard to provide a platform for the community to assess and investigate model capability.

SEED-Bench: 生成的解読能力によるマルチモーダルLLMのベンチマーキング

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

要旨

Support