ChatPaper.aiChatPaper

迈向多模态通用智能之路:通用层级与通用基准

On Path to Multimodal Generalist: General-Level and General-Bench

May 7, 2025
作者: Hao Fei, Yuan Zhou, Juncheng Li, Xiangtai Li, Qingshan Xu, Bobo Li, Shengqiong Wu, Yaoting Wang, Junbao Zhou, Jiahao Meng, Qingyu Shi, Zhiyuan Zhou, Liangtao Shi, Minghe Gao, Daoan Zhang, Zhiqi Ge, Weiming Wu, Siliang Tang, Kaihang Pan, Yaobo Ye, Haobo Yuan, Tao Zhang, Tianjie Ju, Zixiang Meng, Shilin Xu, Liyu Jia, Wentao Hu, Meng Luo, Jiebo Luo, Tat-Seng Chua, Shuicheng Yan, Hanwang Zhang
cs.AI

摘要

多模态大语言模型(MLLM)当前正经历着迅猛发展,这一趋势得益于大语言模型(LLM)的先进能力推动。与早期的专业模型不同,现有的MLLM正朝着多模态通用模型的方向演进。最初,这些模型仅局限于理解多种模态,如今已进步到不仅能跨模态理解,还能进行跨模态生成。其能力范围已从粗粒度的多模态理解扩展至细粒度,从支持有限模态发展到任意模态。尽管已有众多基准测试用于评估MLLM,但一个关键问题浮现:我们能否简单地认为,跨任务性能越高就意味着MLLM能力越强,从而更接近人类水平的AI?我们认为答案并非如此简单。本项目引入了“通用层级”(General-Level)评估框架,定义了MLLM性能与通用性的五级标度,提供了一种比较MLLM并衡量现有系统向更强大多模态通用模型乃至通用人工智能(AGI)迈进的方法论。该框架的核心是“协同性”(Synergy)概念,它衡量模型在理解与生成之间、跨多种模态时是否保持了一致的能力。为支持这一评估,我们推出了“通用基准”(General-Bench),它涵盖了更广泛的技能、模态、格式及能力,包含超过700项任务和325,800个实例。对100多个现有顶尖MLLM的评估结果揭示了通用模型的能力排名,凸显了实现真正AI所面临的挑战。我们期待本项目能为下一代多模态基础模型的研究铺平道路,为加速AGI的实现提供坚实的基石。项目页面:https://generalist.top/
English
The Multimodal Large Language Model (MLLM) is currently experiencing rapid growth, driven by the advanced capabilities of LLMs. Unlike earlier specialists, existing MLLMs are evolving towards a Multimodal Generalist paradigm. Initially limited to understanding multiple modalities, these models have advanced to not only comprehend but also generate across modalities. Their capabilities have expanded from coarse-grained to fine-grained multimodal understanding and from supporting limited modalities to arbitrary ones. While many benchmarks exist to assess MLLMs, a critical question arises: Can we simply assume that higher performance across tasks indicates a stronger MLLM capability, bringing us closer to human-level AI? We argue that the answer is not as straightforward as it seems. This project introduces General-Level, an evaluation framework that defines 5-scale levels of MLLM performance and generality, offering a methodology to compare MLLMs and gauge the progress of existing systems towards more robust multimodal generalists and, ultimately, towards AGI. At the core of the framework is the concept of Synergy, which measures whether models maintain consistent capabilities across comprehension and generation, and across multiple modalities. To support this evaluation, we present General-Bench, which encompasses a broader spectrum of skills, modalities, formats, and capabilities, including over 700 tasks and 325,800 instances. The evaluation results that involve over 100 existing state-of-the-art MLLMs uncover the capability rankings of generalists, highlighting the challenges in reaching genuine AI. We expect this project to pave the way for future research on next-generation multimodal foundation models, providing a robust infrastructure to accelerate the realization of AGI. Project page: https://generalist.top/

Summary

AI-Generated Summary

PDF22May 8, 2025