GMAI-MMBench:面向通用医疗人工智能的综合多模态评估基准
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI
August 6, 2024
作者: Pengcheng Chen, Jin Ye, Guoan Wang, Yanjun Li, Zhongying Deng, Wei Li, Tianbin Li, Haodong Duan, Ziyan Huang, Yanzhou Su, Benyou Wang, Shaoting Zhang, Bin Fu, Jianfei Cai, Bohan Zhuang, Eric J Seibel, Junjun He, Yu Qiao
cs.AI
摘要
大型视觉-语言模型(LVLMs)能够处理多种数据类型,如图像、文本和生理信号,并可应用于各个领域。在医学领域,LVLMs具有为诊断和治疗提供实质性帮助的潜力。然而,在此之前,关键是开发基准来评估LVLMs在各种医学应用中的有效性。当前的基准往往建立在特定学术文献基础之上,主要关注单一领域,缺乏不同感知粒度。因此,它们面临特定挑战,包括临床相关性有限、评估不完整以及对交互式LVLMs的指导不足。为了解决这些限制,我们开发了迄今为止最全面的通用医学人工智能基准GMAI-MMBench,具有良好分类的数据结构和多感知粒度。该基准由285个数据集构成,涵盖39种医学图像模态、18个临床相关任务、18个部门和4种感知粒度,采用视觉问答(VQA)格式。此外,我们实现了一种词汇树结构,允许用户定制评估任务,以满足各种评估需求,大力支持医学人工智能研究和应用。我们评估了50个LVLMs,结果显示,即使是先进的GPT-4o也仅达到52%的准确率,表明有很大的改进空间。此外,我们确定了当前尖端LVLMs中存在的五个关键不足之处,需要解决以推动更好医学应用的发展。我们相信GMAI-MMBench将激励社区构建朝着GMAI的下一代LVLMs。
项目页面:https://uni-medical.github.io/GMAI-MMBench.github.io/
English
Large Vision-Language Models (LVLMs) are capable of handling diverse data
types such as imaging, text, and physiological signals, and can be applied in
various fields. In the medical field, LVLMs have a high potential to offer
substantial assistance for diagnosis and treatment. Before that, it is crucial
to develop benchmarks to evaluate LVLMs' effectiveness in various medical
applications. Current benchmarks are often built upon specific academic
literature, mainly focusing on a single domain, and lacking varying perceptual
granularities. Thus, they face specific challenges, including limited clinical
relevance, incomplete evaluations, and insufficient guidance for interactive
LVLMs. To address these limitations, we developed the GMAI-MMBench, the most
comprehensive general medical AI benchmark with well-categorized data structure
and multi-perceptual granularity to date. It is constructed from 285 datasets
across 39 medical image modalities, 18 clinical-related tasks, 18 departments,
and 4 perceptual granularities in a Visual Question Answering (VQA) format.
Additionally, we implemented a lexical tree structure that allows users to
customize evaluation tasks, accommodating various assessment needs and
substantially supporting medical AI research and applications. We evaluated 50
LVLMs, and the results show that even the advanced GPT-4o only achieves an
accuracy of 52%, indicating significant room for improvement. Moreover, we
identified five key insufficiencies in current cutting-edge LVLMs that need to
be addressed to advance the development of better medical applications. We
believe that GMAI-MMBench will stimulate the community to build the next
generation of LVLMs toward GMAI.
Project Page: https://uni-medical.github.io/GMAI-MMBench.github.io/Summary
AI-Generated Summary