GMAI-MMBench:一個針對通用醫學人工智能的全面多模態評估基準。
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI
August 6, 2024
作者: Pengcheng Chen, Jin Ye, Guoan Wang, Yanjun Li, Zhongying Deng, Wei Li, Tianbin Li, Haodong Duan, Ziyan Huang, Yanzhou Su, Benyou Wang, Shaoting Zhang, Bin Fu, Jianfei Cai, Bohan Zhuang, Eric J Seibel, Junjun He, Yu Qiao
cs.AI
摘要
大型視覺語言模型(LVLMs)能夠處理多樣的數據類型,如影像、文本和生理信號,並可應用於各個領域。在醫學領域中,LVLMs具有高潛力,能夠為診斷和治療提供實質幫助。然而,在此之前,關鍵是要發展基準來評估LVLMs在各種醫學應用中的有效性。目前的基準通常建立在特定學術文獻基礎之上,主要聚焦於單一領域,並缺乏不同感知細節。因此,它們面臨特定挑戰,包括臨床相關性有限、評估不完整以及對互動式LVLMs缺乏指導。為了應對這些限制,我們開發了迄今為止最全面的通用醫學人工智慧基準GMAI-MMBench,具有良好分類的數據結構和多感知細節。它以視覺問答(VQA)格式構建,涵蓋了285個數據集,跨越39種醫學影像模態、18個臨床相關任務、18個部門以及4種感知細節。此外,我們實現了一個詞彙樹結構,允許用戶自定義評估任務,滿足各種評估需求,大力支持醫學人工智慧研究和應用。我們對50個LVLMs進行了評估,結果顯示,即使是先進的GPT-4o也僅實現了52%的準確率,表明有很大的改進空間。此外,我們確定了當前尖端LVLMs中的五個關鍵不足,需要解決以推動更好醫學應用的發展。我們相信GMAI-MMBench將激勵社區建立走向GMAI的下一代LVLMs。
項目頁面:https://uni-medical.github.io/GMAI-MMBench.github.io/
English
Large Vision-Language Models (LVLMs) are capable of handling diverse data
types such as imaging, text, and physiological signals, and can be applied in
various fields. In the medical field, LVLMs have a high potential to offer
substantial assistance for diagnosis and treatment. Before that, it is crucial
to develop benchmarks to evaluate LVLMs' effectiveness in various medical
applications. Current benchmarks are often built upon specific academic
literature, mainly focusing on a single domain, and lacking varying perceptual
granularities. Thus, they face specific challenges, including limited clinical
relevance, incomplete evaluations, and insufficient guidance for interactive
LVLMs. To address these limitations, we developed the GMAI-MMBench, the most
comprehensive general medical AI benchmark with well-categorized data structure
and multi-perceptual granularity to date. It is constructed from 285 datasets
across 39 medical image modalities, 18 clinical-related tasks, 18 departments,
and 4 perceptual granularities in a Visual Question Answering (VQA) format.
Additionally, we implemented a lexical tree structure that allows users to
customize evaluation tasks, accommodating various assessment needs and
substantially supporting medical AI research and applications. We evaluated 50
LVLMs, and the results show that even the advanced GPT-4o only achieves an
accuracy of 52%, indicating significant room for improvement. Moreover, we
identified five key insufficiencies in current cutting-edge LVLMs that need to
be addressed to advance the development of better medical applications. We
believe that GMAI-MMBench will stimulate the community to build the next
generation of LVLMs toward GMAI.
Project Page: https://uni-medical.github.io/GMAI-MMBench.github.io/Summary
AI-Generated Summary