医学中o1的初步研究:我们离AI医生更近了吗?
A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?
September 23, 2024
作者: Yunfei Xie, Juncheng Wu, Haoqin Tu, Siwei Yang, Bingchen Zhao, Yongshuo Zong, Qiao Jin, Cihang Xie, Yuyin Zhou
cs.AI
摘要
大型语言模型(LLMs)在各个领域和任务中展现出卓越的能力,推动了我们在学习和认知方面知识的边界。最新的模型,OpenAI的o1,作为第一个内部化思维链技术并使用强化学习策略的LLM而脱颖而出。虽然它在各种通用语言任务上展示出惊人的能力,但在医学等专业领域的表现尚不明确。因此,本报告全面探讨了o1在不同医学场景下的表现,重点考察了理解、推理和多语言能力这三个关键方面。具体而言,我们的评估涵盖了6个任务,使用了来自37个医学数据集的数据,其中包括两个基于《新英格兰医学杂志》(NEJM)和《柳叶刀》的专业医学测验构建的更具挑战性的问答(QA)任务。这些数据集相较于MedQA等标准医学QA基准具有更大的临床相关性,更有效地转化为真实世界的临床实用性。我们对o1的分析表明,LLMs的增强推理能力可能(显著地)有助于它们理解各种医学指示并推理复杂的临床场景。值得注意的是,o1在准确性方面超过了之前的GPT-4,平均分别提高了6.2%和6.6%,涵盖了19个数据集和两个新创建的复杂QA场景。但与此同时,我们发现了模型能力和现有评估协议中的一些弱点,包括幻觉、多语言能力不一致以及评估指标的差异。我们将原始数据和模型输出发布在https://ucsc-vlaa.github.io/o1_medicine/以供未来研究使用。
English
Large language models (LLMs) have exhibited remarkable capabilities across
various domains and tasks, pushing the boundaries of our knowledge in learning
and cognition. The latest model, OpenAI's o1, stands out as the first LLM with
an internalized chain-of-thought technique using reinforcement learning
strategies. While it has demonstrated surprisingly strong capabilities on
various general language tasks, its performance in specialized fields such as
medicine remains unknown. To this end, this report provides a comprehensive
exploration of o1 on different medical scenarios, examining 3 key aspects:
understanding, reasoning, and multilinguality. Specifically, our evaluation
encompasses 6 tasks using data from 37 medical datasets, including two newly
constructed and more challenging question-answering (QA) tasks based on
professional medical quizzes from the New England Journal of Medicine (NEJM)
and The Lancet. These datasets offer greater clinical relevance compared to
standard medical QA benchmarks such as MedQA, translating more effectively into
real-world clinical utility. Our analysis of o1 suggests that the enhanced
reasoning ability of LLMs may (significantly) benefit their capability to
understand various medical instructions and reason through complex clinical
scenarios. Notably, o1 surpasses the previous GPT-4 in accuracy by an average
of 6.2% and 6.6% across 19 datasets and two newly created complex QA scenarios.
But meanwhile, we identify several weaknesses in both the model capability and
the existing evaluation protocols, including hallucination, inconsistent
multilingual ability, and discrepant metrics for evaluation. We release our raw
data and model outputs at https://ucsc-vlaa.github.io/o1_medicine/ for future
research.Summary
AI-Generated Summary