ChatPaper.aiChatPaper

多模式 ChatGPT 在醫學應用中的實驗研究:GPT-4V 的探討

Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4V

October 29, 2023
作者: Zhiling Yan, Kai Zhang, Rong Zhou, Lifang He, Xiang Li, Lichao Sun
cs.AI

摘要

本文對當前最先進的多模式大型語言模型 GPT-4 與視覺(GPT-4V)在視覺問答(VQA)任務上的能力進行了批判性評估。我們的實驗全面評估了 GPT-4V 在回答與圖像配對的問題方面的能力,使用了來自 11 種模式(例如顯微鏡、皮膚顯微鏡、X 光、CT 等)和十五個感興趣的對象(大腦、肝臟、肺等)的病理學和放射學數據集。我們的數據集涵蓋了全面的醫學詢問範圍,包括十六種不同的問題類型。在我們的評估過程中,我們為 GPT-4V 設計了文本提示,引導它將視覺和文本信息相結合。通過準確度得分的實驗結果得出結論,即目前版本的 GPT-4V 由於在回答診斷性醫學問題時的不可靠和次優準確性,不建議用於現實世界的診斷。此外,我們描述了 GPT-4V 在醫學 VQA 中行為的七個獨特方面,突顯了其在這個複雜領域內的限制。我們的評估案例的完整詳細信息可在 https://github.com/ZhilingYan/GPT4V-Medical-Report 上找到。
English
In this paper, we critically evaluate the capabilities of the state-of-the-art multimodal large language model, i.e., GPT-4 with Vision (GPT-4V), on Visual Question Answering (VQA) task. Our experiments thoroughly assess GPT-4V's proficiency in answering questions paired with images using both pathology and radiology datasets from 11 modalities (e.g. Microscopy, Dermoscopy, X-ray, CT, etc.) and fifteen objects of interests (brain, liver, lung, etc.). Our datasets encompass a comprehensive range of medical inquiries, including sixteen distinct question types. Throughout our evaluations, we devised textual prompts for GPT-4V, directing it to synergize visual and textual information. The experiments with accuracy score conclude that the current version of GPT-4V is not recommended for real-world diagnostics due to its unreliable and suboptimal accuracy in responding to diagnostic medical questions. In addition, we delineate seven unique facets of GPT-4V's behavior in medical VQA, highlighting its constraints within this complex arena. The complete details of our evaluation cases are accessible at https://github.com/ZhilingYan/GPT4V-Medical-Report.

Summary

AI-Generated Summary

PDF81December 15, 2024