ChatPaper.aiChatPaper

医疗应用的多模态ChatGPT:GPT-4V的实验研究

Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4V

October 29, 2023
作者: Zhiling Yan, Kai Zhang, Rong Zhou, Lifang He, Xiang Li, Lichao Sun
cs.AI

摘要

本文对最先进的多模态大型语言模型 GPT-4 带视觉(GPT-4V)在视觉问答(VQA)任务中的能力进行了批判性评估。我们的实验全面评估了 GPT-4V 在回答配对图像问题方面的熟练程度,使用了来自 11 种模态(如显微镜、皮肤镜、X 射线、CT 等)和十五种感兴趣对象(脑、肝脏、肺等)的病理学和放射学数据集。我们的数据集涵盖了广泛的医学问题,包括十六种不同类型的问题。在我们的评估过程中,我们为 GPT-4V 设计了文本提示,引导其将视觉和文本信息进行协同。准确度评分实验得出结论,当前版本的 GPT-4V 由于在回答诊断性医学问题时的不可靠和次优准确性,不建议用于实际诊断。此外,我们描述了 GPT-4V 在医学 VQA 中行为的七个独特方面,突出了其在这一复杂领域内的限制。我们评估案例的完整细节可在 https://github.com/ZhilingYan/GPT4V-Medical-Report 上获取。
English
In this paper, we critically evaluate the capabilities of the state-of-the-art multimodal large language model, i.e., GPT-4 with Vision (GPT-4V), on Visual Question Answering (VQA) task. Our experiments thoroughly assess GPT-4V's proficiency in answering questions paired with images using both pathology and radiology datasets from 11 modalities (e.g. Microscopy, Dermoscopy, X-ray, CT, etc.) and fifteen objects of interests (brain, liver, lung, etc.). Our datasets encompass a comprehensive range of medical inquiries, including sixteen distinct question types. Throughout our evaluations, we devised textual prompts for GPT-4V, directing it to synergize visual and textual information. The experiments with accuracy score conclude that the current version of GPT-4V is not recommended for real-world diagnostics due to its unreliable and suboptimal accuracy in responding to diagnostic medical questions. In addition, we delineate seven unique facets of GPT-4V's behavior in medical VQA, highlighting its constraints within this complex arena. The complete details of our evaluation cases are accessible at https://github.com/ZhilingYan/GPT4V-Medical-Report.

Summary

AI-Generated Summary

PDF81December 15, 2024