MEGA-Bench：将多模态评估扩展到超过500个真实世界任务

摘要

我们提出了MEGA-Bench，这是一个评估套件，将多模态评估扩展到超过500个真实世界任务，以解决最终用户高度异质的日常使用情况。我们的目标是优化一组高质量数据样本，涵盖高度多样化和丰富的多模态任务集，同时实现成本效益和准确的模型评估。具体而言，我们收集了505个现实任务，涵盖了来自16位专家注释者的超过8,000个样本，以广泛覆盖多模态任务空间。我们没有将这些问题统一为标准的多选题（如MMMU、MMBench和MMT-Bench），而是采用了各种输出格式，如数字、短语、代码、\LaTeX、坐标、JSON、自由格式等。为了适应这些格式，我们开发了超过40个指标来评估这些任务。与现有基准不同，MEGA-Bench提供了跨多个维度（例如应用程序、输入类型、输出格式、技能）的细粒度能力报告，允许用户深入交互和可视化模型能力。我们在MEGA-Bench上评估了各种前沿的视觉-语言模型，以了解它们在这些维度上的能力。

English

We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks, to address the highly heterogeneous daily use cases of end users. Our objective is to optimize for a set of high-quality data samples that cover a highly diverse and rich set of multimodal tasks, while enabling cost-effective and accurate model evaluation. In particular, we collected 505 realistic tasks encompassing over 8,000 samples from 16 expert annotators to extensively cover the multimodal task space. Instead of unifying these problems into standard multi-choice questions (like MMMU, MMBench, and MMT-Bench), we embrace a wide range of output formats like numbers, phrases, code, \LaTeX, coordinates, JSON, free-form, etc. To accommodate these formats, we developed over 40 metrics to evaluate these tasks. Unlike existing benchmarks, MEGA-Bench offers a fine-grained capability report across multiple dimensions (e.g., application, input type, output format, skill), allowing users to interact with and visualize model capabilities in depth. We evaluate a wide variety of frontier vision-language models on MEGA-Bench to understand their capabilities across these dimensions.

MEGA-Bench：将多模态评估扩展到超过500个真实世界任务

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

摘要

Support