SEED-Bench: 생성적 이해 능력을 통해 멀티모달 LLM 벤치마킹

초록

강력한 대형 언어 모델(LLMs)을 기반으로, 최근 생성형 멀티모달 대형 언어 모델(MLLMs)이 중요한 연구 분야로 부각되며, 이해와 생성 모두에서 뛰어난 능력을 보여주고 있습니다. 본 연구에서는 생성 모델의 포괄적인 평가를 위한 예비 단계로서 MLLMs의 생성적 이해 능력 평가를 다루며, SEED-Bench라는 벤치마크를 소개합니다. SEED-Bench는 정확한 인간 주석이 달린 19,000개의 객관식 문제로 구성되어 있으며(기존 벤치마크보다 6배 큼), 이미지와 비디오 양식의 이해를 포함한 12개의 평가 차원을 아우릅니다. 우리는 특정 평가 차원을 대상으로 하는 객관식 문제 생성을 위한 고급 파이프라인을 개발하고, 자동 필터링과 수동 검증 프로세스를 통합했습니다. 인간 주석에서 도출된 정답 옵션이 있는 객관식 문제는 평가 과정에서 인간이나 GPT의 개입 없이도 모델 성능을 객관적이고 효율적으로 평가할 수 있게 합니다. 또한, 우리는 공간적 및 시간적 이해를 모두 포함한 12개 차원에 걸쳐 18개 모델의 성능을 평가합니다. 평가 결과를 통해 기존 MLLMs의 한계를 드러냄으로써, SEED-Bench가 미래 연구를 촉진하는 데 통찰을 제공할 수 있기를 목표로 합니다. 우리는 커뮤니티가 모델 능력을 평가하고 연구할 수 있는 플랫폼을 제공하기 위해 리더보드를 출시하고 지속적으로 유지할 예정입니다.

English

Based on powerful Large Language Models (LLMs), recent generative Multimodal Large Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting remarkable capability for both comprehension and generation. In this work, we address the evaluation of generative comprehension in MLLMs as a preliminary step towards a comprehensive assessment of generative models, by introducing a benchmark named SEED-Bench. SEED-Bench consists of 19K multiple choice questions with accurate human annotations (x 6 larger than existing benchmarks), which spans 12 evaluation dimensions including the comprehension of both the image and video modality. We develop an advanced pipeline for generating multiple-choice questions that target specific evaluation dimensions, integrating both automatic filtering and manual verification processes. Multiple-choice questions with groundtruth options derived from human annotation enables an objective and efficient assessment of model performance, eliminating the need for human or GPT intervention during evaluation. We further evaluate the performance of 18 models across all 12 dimensions, covering both the spatial and temporal understanding. By revealing the limitations of existing MLLMs through evaluation results, we aim for SEED-Bench to provide insights for motivating future research. We will launch and consistently maintain a leaderboard to provide a platform for the community to assess and investigate model capability.

SEED-Bench: 생성적 이해 능력을 통해 멀티모달 LLM 벤치마킹

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

초록

Support