靈樞：一個通用基礎模型，用於統一的多模態醫學理解與推理

摘要

多模態大型語言模型（MLLMs）在理解常見視覺元素方面展現了令人印象深刻的能力，這主要歸功於其大規模數據集和先進的訓練策略。然而，由於醫學場景中的數據和任務與通用領域之間存在固有差異，這些模型在醫學應用中的有效性仍然有限。具體而言，現有的醫學MLLMs面臨以下關鍵限制：（1）醫學知識覆蓋範圍有限，僅限於影像；（2）由於數據整理過程不完善，更容易產生幻覺；（3）缺乏針對複雜醫學場景的推理能力。為應對這些挑戰，我們首先提出了一套全面的數據整理流程，該流程（1）不僅從醫學影像中高效獲取豐富的醫學知識數據，還從廣泛的醫學文本和通用領域數據中獲取；（2）合成精確的醫學描述、視覺問答（VQA）和推理樣本。由此，我們構建了一個富含廣泛醫學知識的多模態數據集。基於整理的數據，我們推出了專注於醫學的MLLM：靈樞。靈樞經過多階段訓練，逐步嵌入醫學專業知識並增強其任務解決能力。此外，我們初步探索了應用可驗證獎勵的強化學習範式來提升靈樞的醫學推理能力。同時，我們開發了MedEvalKit，這是一個統一的評估框架，整合了領先的多模態和文本醫學基準，用於標準化、公平且高效的模型評估。我們在三個基本醫學任務上評估了靈樞的表現，包括多模態問答、基於文本的問答和醫學報告生成。結果顯示，靈樞在多數任務上始終優於現有的開源多模態模型……

English

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in understanding common visual elements, largely due to their large-scale datasets and advanced training strategies. However, their effectiveness in medical applications remains limited due to the inherent discrepancies between data and tasks in medical scenarios and those in the general domain. Concretely, existing medical MLLMs face the following critical limitations: (1) limited coverage of medical knowledge beyond imaging, (2) heightened susceptibility to hallucinations due to suboptimal data curation processes, (3) lack of reasoning capabilities tailored for complex medical scenarios. To address these challenges, we first propose a comprehensive data curation procedure that (1) efficiently acquires rich medical knowledge data not only from medical imaging but also from extensive medical texts and general-domain data; and (2) synthesizes accurate medical captions, visual question answering (VQA), and reasoning samples. As a result, we build a multimodal dataset enriched with extensive medical knowledge. Building on the curated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities progressively. Besides, we preliminarily explore the potential of applying reinforcement learning with verifiable rewards paradigm to enhance Lingshu's medical reasoning ability. Additionally, we develop MedEvalKit, a unified evaluation framework that consolidates leading multimodal and textual medical benchmarks for standardized, fair, and efficient model assessment. We evaluate the performance of Lingshu on three fundamental medical tasks, multimodal QA, text-based QA, and medical report generation. The results show that Lingshu consistently outperforms the existing open-source multimodal models on most tasks ...

靈樞：一個通用基礎模型，用於統一的多模態醫學理解與推理

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

摘要

Support