ChatPaper.aiChatPaper

SEED-Bench:以生成理解為特色的多模式LLM基準測試

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

July 30, 2023
作者: Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, Ying Shan
cs.AI

摘要

基於強大的大型語言模型(LLMs),最近出現的生成式多模態大型語言模型(MLLMs)作為一個重要的研究領域備受矚目,展現出卓越的理解和生成能力。在這項工作中,我們著重於評估MLLMs中生成式理解的工作,作為對生成模型全面評估的初步步驟,引入了一個名為SEED-Bench的基準測試。SEED-Bench包含19K個多選題,具有準確的人類標註(比現有基準測試大6倍),涵蓋了12個評估維度,包括圖像和視頻模態的理解。我們開發了一個先進的流程,用於生成針對特定評估維度的多選題,整合了自動篩選和手動驗證過程。多選題的標準答案來自人類標註,實現了對模型性能的客觀高效評估,無需在評估過程中進行人工或GPT干預。我們進一步評估了18個模型在所有12個維度上的表現,涵蓋了空間和時間理解。通過評估結果揭示現有MLLMs的局限性,我們希望SEED-Bench能為激勵未來研究提供見解。我們將啟動並持續維護一個排行榜,為社區提供評估和探究模型能力的平台。
English
Based on powerful Large Language Models (LLMs), recent generative Multimodal Large Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting remarkable capability for both comprehension and generation. In this work, we address the evaluation of generative comprehension in MLLMs as a preliminary step towards a comprehensive assessment of generative models, by introducing a benchmark named SEED-Bench. SEED-Bench consists of 19K multiple choice questions with accurate human annotations (x 6 larger than existing benchmarks), which spans 12 evaluation dimensions including the comprehension of both the image and video modality. We develop an advanced pipeline for generating multiple-choice questions that target specific evaluation dimensions, integrating both automatic filtering and manual verification processes. Multiple-choice questions with groundtruth options derived from human annotation enables an objective and efficient assessment of model performance, eliminating the need for human or GPT intervention during evaluation. We further evaluate the performance of 18 models across all 12 dimensions, covering both the spatial and temporal understanding. By revealing the limitations of existing MLLMs through evaluation results, we aim for SEED-Bench to provide insights for motivating future research. We will launch and consistently maintain a leaderboard to provide a platform for the community to assess and investigate model capability.
PDF72December 15, 2024