ChatPaper.aiChatPaper

Gemini 模型在醫學上的能力

Capabilities of Gemini Models in Medicine

April 29, 2024
作者: Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, Juanma Zambrano Chaves, Szu-Yeu Hu, Mike Schaekermann, Aishwarya Kamath, Yong Cheng, David G. T. Barrett, Cathy Cheung, Basil Mustafa, Anil Palepu, Daniel McDuff, Le Hou, Tomer Golany, Luyang Liu, Jean-baptiste Alayrac, Neil Houlsby, Nenad Tomasev, Jan Freyberg, Charles Lau, Jonas Kemp, Jeremy Lai, Shekoofeh Azizi, Kimberly Kanada, SiWai Man, Kavita Kulkarni, Ruoxi Sun, Siamak Shakeri, Luheng He, Ben Caine, Albert Webson, Natasha Latysheva, Melvin Johnson, Philip Mansfield, Jian Lu, Ehud Rivlin, Jesper Anderson, Bradley Green, Renee Wong, Jonathan Krause, Jonathon Shlens, Ewa Dominowska, S. M. Ali Eslami, Claire Cui, Oriol Vinyals, Koray Kavukcuoglu, James Manyika, Jeff Dean, Demis Hassabis, Yossi Matias, Dale Webster, Joelle Barral, Greg Corrado, Christopher Semturs, S. Sara Mahdavi, Juraj Gottweis, Alan Karthikesalingam, Vivek Natarajan
cs.AI

摘要

在各種醫學應用中取得卓越表現對 AI 提出了相當大的挑戰,需要先進的推理能力、接觸最新的醫學知識以及對複雜多模態數據的理解。Gemini 模型在多模態和長上下文推理方面具有強大的通用能力,在醫學領域提供了令人振奮的可能性。基於 Gemini 的這些核心優勢,我們引入了 Med-Gemini,這是一系列在醫學領域專業化並具有無縫使用網絡搜索能力的高效多模態模型,可以通過自定義編碼器有效地適應新的模態。我們在 14 個醫學基準測試上評估了 Med-Gemini,在其中 10 個基準測試上確立了新的最先進表現,並在每個可以進行直接比較的基準測試上超越了 GPT-4 模型系列,往往超出很大範圍。在流行的 MedQA(USMLE)基準測試中,我們表現最佳的 Med-Gemini 模型以 91.1% 的準確率實現了最先進表現,採用了一種新穎的不確定性引導搜索策略。在包括 NEJM 圖像挑戰和 MMMU(健康與醫學)在內的 7 個多模態基準測試中,Med-Gemini 的平均相對優勢提高了 44.5%,超越了 GPT-4V。我們通過在長匿名健康記錄和醫學視頻問答中實現了最先進表現,超越了先前僅使用上下文學習的專門方法的 Med-Gemini 的長上下文能力的有效性。最後,Med-Gemini 的表現表明,在醫學文本摘要等任務上超越了人類專家,並展示了在多模態醫學對話、醫學研究和教育方面具有有希望的潛力。綜上所述,我們的結果提供了令人信服的證據,證明了 Med-Gemini 的潛力,儘管在這個安全關鍵領域進行真實世界部署之前,進一步嚴格評估將至關重要。
English
Excellence in a wide variety of medical applications poses considerable challenges for AI, requiring advanced reasoning, access to up-to-date medical knowledge and understanding of complex multimodal data. Gemini models, with strong general capabilities in multimodal and long-context reasoning, offer exciting possibilities in medicine. Building on these core strengths of Gemini, we introduce Med-Gemini, a family of highly capable multimodal models that are specialized in medicine with the ability to seamlessly use web search, and that can be efficiently tailored to novel modalities using custom encoders. We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art (SoTA) performance on 10 of them, and surpass the GPT-4 model family on every benchmark where a direct comparison is viable, often by a wide margin. On the popular MedQA (USMLE) benchmark, our best-performing Med-Gemini model achieves SoTA performance of 91.1% accuracy, using a novel uncertainty-guided search strategy. On 7 multimodal benchmarks including NEJM Image Challenges and MMMU (health & medicine), Med-Gemini improves over GPT-4V by an average relative margin of 44.5%. We demonstrate the effectiveness of Med-Gemini's long-context capabilities through SoTA performance on a needle-in-a-haystack retrieval task from long de-identified health records and medical video question answering, surpassing prior bespoke methods using only in-context learning. Finally, Med-Gemini's performance suggests real-world utility by surpassing human experts on tasks such as medical text summarization, alongside demonstrations of promising potential for multimodal medical dialogue, medical research and education. Taken together, our results offer compelling evidence for Med-Gemini's potential, although further rigorous evaluation will be crucial before real-world deployment in this safety-critical domain.

Summary

AI-Generated Summary

PDF253December 15, 2024