灵枢：面向统一多模态医学理解与推理的通用基础模型

摘要

多模态大语言模型（MLLMs）在理解常见视觉元素方面展现了卓越的能力，这主要得益于其大规模数据集和先进的训练策略。然而，在医疗应用领域，由于医疗场景与通用领域在数据和任务上存在固有差异，这些模型的效果仍显不足。具体而言，现有的医疗MLLMs面临以下关键局限：（1）医疗知识覆盖范围有限，主要局限于影像领域；（2）由于数据筛选过程欠佳，更容易产生幻觉；（3）缺乏针对复杂医疗场景的定制化推理能力。为应对这些挑战，我们首先提出了一套全面的数据筛选流程，该流程（1）不仅从医学影像中高效获取丰富的医疗知识数据，还广泛涵盖医学文本及通用领域数据；（2）合成精确的医学描述、视觉问答（VQA）及推理样本。由此，我们构建了一个富含广泛医疗知识的多模态数据集。基于此筛选数据，我们推出了专为医疗设计的MLLM——灵枢。灵枢通过多阶段训练，逐步嵌入医疗专业知识并提升其任务解决能力。此外，我们初步探索了应用可验证奖励机制的强化学习，以增强灵枢的医疗推理能力。同时，我们开发了MedEvalKit，一个统一评估框架，整合了领先的多模态及文本医疗基准，以实现标准化、公平且高效的模型评估。我们在三项基础医疗任务——多模态问答、基于文本的问答及医疗报告生成上对灵枢进行了性能评估。结果显示，灵枢在多数任务上持续超越现有的开源多模态模型……

English

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in understanding common visual elements, largely due to their large-scale datasets and advanced training strategies. However, their effectiveness in medical applications remains limited due to the inherent discrepancies between data and tasks in medical scenarios and those in the general domain. Concretely, existing medical MLLMs face the following critical limitations: (1) limited coverage of medical knowledge beyond imaging, (2) heightened susceptibility to hallucinations due to suboptimal data curation processes, (3) lack of reasoning capabilities tailored for complex medical scenarios. To address these challenges, we first propose a comprehensive data curation procedure that (1) efficiently acquires rich medical knowledge data not only from medical imaging but also from extensive medical texts and general-domain data; and (2) synthesizes accurate medical captions, visual question answering (VQA), and reasoning samples. As a result, we build a multimodal dataset enriched with extensive medical knowledge. Building on the curated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities progressively. Besides, we preliminarily explore the potential of applying reinforcement learning with verifiable rewards paradigm to enhance Lingshu's medical reasoning ability. Additionally, we develop MedEvalKit, a unified evaluation framework that consolidates leading multimodal and textual medical benchmarks for standardized, fair, and efficient model assessment. We evaluate the performance of Lingshu on three fundamental medical tasks, multimodal QA, text-based QA, and medical report generation. The results show that Lingshu consistently outperforms the existing open-source multimodal models on most tasks ...

灵枢：面向统一多模态医学理解与推理的通用基础模型

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

摘要

Support