迈向多模态通才之路:通用层级与通用基准
On Path to Multimodal Generalist: General-Level and General-Bench
May 7, 2025
作者: Hao Fei, Yuan Zhou, Juncheng Li, Xiangtai Li, Qingshan Xu, Bobo Li, Shengqiong Wu, Yaoting Wang, Junbao Zhou, Jiahao Meng, Qingyu Shi, Zhiyuan Zhou, Liangtao Shi, Minghe Gao, Daoan Zhang, Zhiqi Ge, Weiming Wu, Siliang Tang, Kaihang Pan, Yaobo Ye, Haobo Yuan, Tao Zhang, Tianjie Ju, Zixiang Meng, Shilin Xu, Liyu Jia, Wentao Hu, Meng Luo, Jiebo Luo, Tat-Seng Chua, Shuicheng Yan, Hanwang Zhang
cs.AI
摘要
多模态大语言模型(MLLM)当前正处于快速发展阶段,这一趋势得益于大语言模型(LLM)的先进能力。与早期的专业模型不同,现有的MLLM正朝着多模态通用模型(Multimodal Generalist)的范式演进。这些模型最初仅限于理解多种模态,现已发展到不仅能理解还能跨模态生成内容。其能力已从粗粒度的多模态理解扩展到细粒度,从支持有限模态到任意模态。尽管已有众多基准用于评估MLLM,但一个关键问题浮现:我们能否简单地认为跨任务性能越高,MLLM的能力就越强,从而更接近人类水平的AI?我们认为答案并非如此简单。本项目引入了“通用层级”(General-Level)评估框架,定义了MLLM性能与通用性的五级量表,提供了一种比较MLLM并衡量现有系统向更强大多模态通用模型乃至通用人工智能(AGI)迈进的方法论。该框架的核心是“协同效应”(Synergy)概念,它衡量模型在理解与生成、跨多模态之间是否保持了一致的能力。为支持这一评估,我们提出了“通用基准”(General-Bench),它涵盖了更广泛的技能、模态、格式和能力,包括超过700项任务和325,800个实例。涉及100多个现有最先进MLLM的评估结果揭示了通用模型的能力排名,凸显了实现真正AI的挑战。我们期待本项目能为下一代多模态基础模型的研究铺平道路,为加速AGI的实现提供坚实的基础设施。项目页面:https://generalist.top/
English
The Multimodal Large Language Model (MLLM) is currently experiencing rapid
growth, driven by the advanced capabilities of LLMs. Unlike earlier
specialists, existing MLLMs are evolving towards a Multimodal Generalist
paradigm. Initially limited to understanding multiple modalities, these models
have advanced to not only comprehend but also generate across modalities. Their
capabilities have expanded from coarse-grained to fine-grained multimodal
understanding and from supporting limited modalities to arbitrary ones. While
many benchmarks exist to assess MLLMs, a critical question arises: Can we
simply assume that higher performance across tasks indicates a stronger MLLM
capability, bringing us closer to human-level AI? We argue that the answer is
not as straightforward as it seems. This project introduces General-Level, an
evaluation framework that defines 5-scale levels of MLLM performance and
generality, offering a methodology to compare MLLMs and gauge the progress of
existing systems towards more robust multimodal generalists and, ultimately,
towards AGI. At the core of the framework is the concept of Synergy, which
measures whether models maintain consistent capabilities across comprehension
and generation, and across multiple modalities. To support this evaluation, we
present General-Bench, which encompasses a broader spectrum of skills,
modalities, formats, and capabilities, including over 700 tasks and 325,800
instances. The evaluation results that involve over 100 existing
state-of-the-art MLLMs uncover the capability rankings of generalists,
highlighting the challenges in reaching genuine AI. We expect this project to
pave the way for future research on next-generation multimodal foundation
models, providing a robust infrastructure to accelerate the realization of AGI.
Project page: https://generalist.top/Summary
AI-Generated Summary