MEGA-Bench：將多模態評估擴展至超過500個真實世界任務

摘要

我們提出了MEGA-Bench，這是一個評估套件，將多模態評估擴展到超過500個真實世界任務，以應對最終用戶高度異質的日常使用情況。我們的目標是優化一組高質量數據樣本，涵蓋高度多樣化和豐富的多模態任務集，同時實現成本效益和準確的模型評估。具體而言，我們收集了505個現實任務，包括來自16位專家標註者的8000多個樣本，以廣泛覆蓋多模態任務空間。我們沒有將這些問題統一為標準的多選問題（如MMMU、MMBench和MMT-Bench），而是採用了各種輸出格式，如數字、短語、代碼、\LaTeX、坐標、JSON、自由格式等。為了適應這些格式，我們開發了40多個指標來評估這些任務。與現有基準不同，MEGA-Bench提供了跨多個維度（例如應用、輸入類型、輸出格式、技能）的精細化能力報告，使用戶可以深入互動和可視化模型的能力。我們在MEGA-Bench上評估了各種前沿的視覺語言模型，以了解它們在這些維度上的能力。

English

We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks, to address the highly heterogeneous daily use cases of end users. Our objective is to optimize for a set of high-quality data samples that cover a highly diverse and rich set of multimodal tasks, while enabling cost-effective and accurate model evaluation. In particular, we collected 505 realistic tasks encompassing over 8,000 samples from 16 expert annotators to extensively cover the multimodal task space. Instead of unifying these problems into standard multi-choice questions (like MMMU, MMBench, and MMT-Bench), we embrace a wide range of output formats like numbers, phrases, code, \LaTeX, coordinates, JSON, free-form, etc. To accommodate these formats, we developed over 40 metrics to evaluate these tasks. Unlike existing benchmarks, MEGA-Bench offers a fine-grained capability report across multiple dimensions (e.g., application, input type, output format, skill), allowing users to interact with and visualize model capabilities in depth. We evaluate a wide variety of frontier vision-language models on MEGA-Bench to understand their capabilities across these dimensions.

MEGA-Bench：將多模態評估擴展至超過500個真實世界任務

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

摘要

Support