走向通用生物医学人工智能

摘要

医学本质上是多模态的，具有丰富的数据形式，涵盖文本、影像、基因组学等。灵活编码、整合和解释这些数据的广义生物医学人工智能（AI）系统，在规模上可以潜在地实现从科学发现到护理交付等具有影响力的应用。为了促进这些模型的发展，我们首先策划了MultiMedBench，这是一个新的多模态生物医学基准。MultiMedBench包括14个不同的任务，如医学问题回答、乳腺X线摄影和皮肤科图像解释、放射学报告生成和总结，以及基因组变异调用等。然后，我们介绍了Med-PaLM多模态（Med-PaLM M），这是我们的广义生物医学AI系统的概念验证。Med-PaLM M是一个大型多模态生成模型，可以灵活地编码和解释包括临床语言、影像和基因组在内的生物医学数据，使用相同的模型权重。Med-PaLM M在所有MultiMedBench任务上达到了与或超过现有技术水平的性能，往往超过专家模型很大幅度。我们还报告了对新颖医学概念和任务的零样本泛化示例，任务间的正迁移学习，以及新兴的零样本医学推理。为了进一步探究Med-PaLM M的能力和局限性，我们进行了放射科医师对模型生成（和人类）胸部X线报告的评估，并观察到在不同模型规模下鼓舞人心的表现。在对246个回顾性胸部X光片进行并排排名时，临床医生在多达40.50%的情况下对Med-PaLM M的报告表达了与放射科医师相比的偏好，表明潜在的临床实用性。尽管需要大量工作来验证这些模型在实际用例中的应用，但我们的结果代表了通向发展广义生物医学AI系统的里程碑。

English

Medicine is inherently multimodal, with rich data modalities spanning text, imaging, genomics, and more. Generalist biomedical artificial intelligence (AI) systems that flexibly encode, integrate, and interpret this data at scale can potentially enable impactful applications ranging from scientific discovery to care delivery. To enable the development of these models, we first curate MultiMedBench, a new multimodal biomedical benchmark. MultiMedBench encompasses 14 diverse tasks such as medical question answering, mammography and dermatology image interpretation, radiology report generation and summarization, and genomic variant calling. We then introduce Med-PaLM Multimodal (Med-PaLM M), our proof of concept for a generalist biomedical AI system. Med-PaLM M is a large multimodal generative model that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same set of model weights. Med-PaLM M reaches performance competitive with or exceeding the state of the art on all MultiMedBench tasks, often surpassing specialist models by a wide margin. We also report examples of zero-shot generalization to novel medical concepts and tasks, positive transfer learning across tasks, and emergent zero-shot medical reasoning. To further probe the capabilities and limitations of Med-PaLM M, we conduct a radiologist evaluation of model-generated (and human) chest X-ray reports and observe encouraging performance across model scales. In a side-by-side ranking on 246 retrospective chest X-rays, clinicians express a pairwise preference for Med-PaLM M reports over those produced by radiologists in up to 40.50% of cases, suggesting potential clinical utility. While considerable work is needed to validate these models in real-world use cases, our results represent a milestone towards the development of generalist biomedical AI systems.

走向通用生物医学人工智能

Towards Generalist Biomedical AI

摘要

Support