迈向多模态通才之路：通用层级与通用基准

摘要

多模态大语言模型（MLLM）当前正处于快速发展阶段，这一趋势得益于大语言模型（LLM）的先进能力。与早期的专业模型不同，现有的MLLM正朝着多模态通用模型（Multimodal Generalist）的范式演进。这些模型最初仅限于理解多种模态，现已发展到不仅能理解还能跨模态生成内容。其能力已从粗粒度的多模态理解扩展到细粒度，从支持有限模态到任意模态。尽管已有众多基准用于评估MLLM，但一个关键问题浮现：我们能否简单地认为跨任务性能越高，MLLM的能力就越强，从而更接近人类水平的AI？我们认为答案并非如此简单。本项目引入了“通用层级”（General-Level）评估框架，定义了MLLM性能与通用性的五级量表，提供了一种比较MLLM并衡量现有系统向更强大多模态通用模型乃至通用人工智能（AGI）迈进的方法论。该框架的核心是“协同效应”（Synergy）概念，它衡量模型在理解与生成、跨多模态之间是否保持了一致的能力。为支持这一评估，我们提出了“通用基准”（General-Bench），它涵盖了更广泛的技能、模态、格式和能力，包括超过700项任务和325,800个实例。涉及100多个现有最先进MLLM的评估结果揭示了通用模型的能力排名，凸显了实现真正AI的挑战。我们期待本项目能为下一代多模态基础模型的研究铺平道路，为加速AGI的实现提供坚实的基础设施。项目页面：https://generalist.top/

English

The Multimodal Large Language Model (MLLM) is currently experiencing rapid growth, driven by the advanced capabilities of LLMs. Unlike earlier specialists, existing MLLMs are evolving towards a Multimodal Generalist paradigm. Initially limited to understanding multiple modalities, these models have advanced to not only comprehend but also generate across modalities. Their capabilities have expanded from coarse-grained to fine-grained multimodal understanding and from supporting limited modalities to arbitrary ones. While many benchmarks exist to assess MLLMs, a critical question arises: Can we simply assume that higher performance across tasks indicates a stronger MLLM capability, bringing us closer to human-level AI? We argue that the answer is not as straightforward as it seems. This project introduces General-Level, an evaluation framework that defines 5-scale levels of MLLM performance and generality, offering a methodology to compare MLLMs and gauge the progress of existing systems towards more robust multimodal generalists and, ultimately, towards AGI. At the core of the framework is the concept of Synergy, which measures whether models maintain consistent capabilities across comprehension and generation, and across multiple modalities. To support this evaluation, we present General-Bench, which encompasses a broader spectrum of skills, modalities, formats, and capabilities, including over 700 tasks and 325,800 instances. The evaluation results that involve over 100 existing state-of-the-art MLLMs uncover the capability rankings of generalists, highlighting the challenges in reaching genuine AI. We expect this project to pave the way for future research on next-generation multimodal foundation models, providing a robust infrastructure to accelerate the realization of AGI. Project page: https://generalist.top/

迈向多模态通才之路：通用层级与通用基准

On Path to Multimodal Generalist: General-Level and General-Bench

摘要

Support