SEED-Bench:用生成理解评估多模态LLM的基准测试
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
July 30, 2023
作者: Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, Ying Shan
cs.AI
摘要
基于强大的大型语言模型(LLMs),最近出现了备受关注的生成式多模态大型语言模型(MLLMs)作为一个关键研究领域,展现出对理解和生成都具有显著能力。在这项工作中,我们着眼于评估MLLMs中生成式理解的工作,作为全面评估生成模型的初步步骤,引入了一个名为SEED-Bench的基准。SEED-Bench包括19K个带有准确人类注释的多项选择题(比现有基准大6倍),涵盖了12个评估维度,包括图像和视频模态的理解。我们开发了一个先进的流程,用于生成针对特定评估维度的多项选择题,整合了自动筛选和手动验证过程。多项选择题的正确选项来自人类注释,可以客观高效地评估模型性能,无需在评估过程中进行人类或GPT干预。我们进一步评估了18个模型在所有12个维度上的性能,涵盖了空间和时间理解。通过评估结果揭示现有MLLMs的局限性,我们希望SEED-Bench能够为激励未来研究提供见解。我们将推出并持续维护一个排行榜,为社区提供一个评估和探究模型能力的平台。
English
Based on powerful Large Language Models (LLMs), recent generative Multimodal
Large Language Models (MLLMs) have gained prominence as a pivotal research
area, exhibiting remarkable capability for both comprehension and generation.
In this work, we address the evaluation of generative comprehension in MLLMs as
a preliminary step towards a comprehensive assessment of generative models, by
introducing a benchmark named SEED-Bench. SEED-Bench consists of 19K multiple
choice questions with accurate human annotations (x 6 larger than existing
benchmarks), which spans 12 evaluation dimensions including the comprehension
of both the image and video modality. We develop an advanced pipeline for
generating multiple-choice questions that target specific evaluation
dimensions, integrating both automatic filtering and manual verification
processes. Multiple-choice questions with groundtruth options derived from
human annotation enables an objective and efficient assessment of model
performance, eliminating the need for human or GPT intervention during
evaluation. We further evaluate the performance of 18 models across all 12
dimensions, covering both the spatial and temporal understanding. By revealing
the limitations of existing MLLMs through evaluation results, we aim for
SEED-Bench to provide insights for motivating future research. We will launch
and consistently maintain a leaderboard to provide a platform for the community
to assess and investigate model capability.