의료 응용을 위한 멀티모달 ChatGPT: GPT-4V의 실험적 연구

초록

본 논문에서는 최신 멀티모달 대규모 언어 모델인 GPT-4 with Vision(GPT-4V)의 시각 질의응답(VQA) 과제 수행 능력을 비판적으로 평가합니다. 우리의 실험은 병리학 및 방사선학 데이터셋에서 11가지 모달리티(예: 현미경, 피부경, X선, CT 등)와 15개의 관심 대상(뇌, 간, 폐 등)을 사용하여 이미지와 짝을 이루는 질문에 대한 GPT-4V의 숙련도를 철저히 평가합니다. 우리의 데이터셋은 16가지의 독특한 질문 유형을 포함한 포괄적인 범위의 의학적 질문을 다룹니다. 평가 과정에서 우리는 GPT-4V가 시각 및 텍스트 정보를 융합하도록 유도하는 텍스트 프롬프트를 설계했습니다. 정확도 점수를 기반으로 한 실험 결과, 현재 버전의 GPT-4V는 진단 의학 질문에 대한 응답에서 신뢰할 수 없고 최적에 미치지 못하는 정확도로 인해 실제 진단 환경에서 사용하기에는 적합하지 않은 것으로 결론지었습니다. 또한, 우리는 의학 VQA에서 GPT-4V의 행동 양상을 7가지 독특한 측면으로 구분하여 이 복잡한 영역에서의 한계를 강조합니다. 평가 사례의 전체 세부 사항은 https://github.com/ZhilingYan/GPT4V-Medical-Report에서 확인할 수 있습니다.

English

In this paper, we critically evaluate the capabilities of the state-of-the-art multimodal large language model, i.e., GPT-4 with Vision (GPT-4V), on Visual Question Answering (VQA) task. Our experiments thoroughly assess GPT-4V's proficiency in answering questions paired with images using both pathology and radiology datasets from 11 modalities (e.g. Microscopy, Dermoscopy, X-ray, CT, etc.) and fifteen objects of interests (brain, liver, lung, etc.). Our datasets encompass a comprehensive range of medical inquiries, including sixteen distinct question types. Throughout our evaluations, we devised textual prompts for GPT-4V, directing it to synergize visual and textual information. The experiments with accuracy score conclude that the current version of GPT-4V is not recommended for real-world diagnostics due to its unreliable and suboptimal accuracy in responding to diagnostic medical questions. In addition, we delineate seven unique facets of GPT-4V's behavior in medical VQA, highlighting its constraints within this complex arena. The complete details of our evaluation cases are accessible at https://github.com/ZhilingYan/GPT4V-Medical-Report.

의료 응용을 위한 멀티모달 ChatGPT: GPT-4V의 실험적 연구

Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4V

초록

Support