靈樞:一個通用基礎模型,用於統一的多模態醫學理解與推理
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
June 8, 2025
作者: LASA Team, Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, Yu Sun, Junao Shen, Chaojun Wang, Jie Tan, Deli Zhao, Tingyang Xu, Hao Zhang, Yu Rong
cs.AI
摘要
多模態大型語言模型(MLLMs)在理解常見視覺元素方面展現了令人印象深刻的能力,這主要歸功於其大規模數據集和先進的訓練策略。然而,由於醫學場景中的數據和任務與通用領域之間存在固有差異,這些模型在醫學應用中的有效性仍然有限。具體而言,現有的醫學MLLMs面臨以下關鍵限制:(1)醫學知識覆蓋範圍有限,僅限於影像;(2)由於數據整理過程不完善,更容易產生幻覺;(3)缺乏針對複雜醫學場景的推理能力。為應對這些挑戰,我們首先提出了一套全面的數據整理流程,該流程(1)不僅從醫學影像中高效獲取豐富的醫學知識數據,還從廣泛的醫學文本和通用領域數據中獲取;(2)合成精確的醫學描述、視覺問答(VQA)和推理樣本。由此,我們構建了一個富含廣泛醫學知識的多模態數據集。基於整理的數據,我們推出了專注於醫學的MLLM:靈樞。靈樞經過多階段訓練,逐步嵌入醫學專業知識並增強其任務解決能力。此外,我們初步探索了應用可驗證獎勵的強化學習範式來提升靈樞的醫學推理能力。同時,我們開發了MedEvalKit,這是一個統一的評估框架,整合了領先的多模態和文本醫學基準,用於標準化、公平且高效的模型評估。我們在三個基本醫學任務上評估了靈樞的表現,包括多模態問答、基於文本的問答和醫學報告生成。結果顯示,靈樞在多數任務上始終優於現有的開源多模態模型……
English
Multimodal Large Language Models (MLLMs) have demonstrated impressive
capabilities in understanding common visual elements, largely due to their
large-scale datasets and advanced training strategies. However, their
effectiveness in medical applications remains limited due to the inherent
discrepancies between data and tasks in medical scenarios and those in the
general domain. Concretely, existing medical MLLMs face the following critical
limitations: (1) limited coverage of medical knowledge beyond imaging, (2)
heightened susceptibility to hallucinations due to suboptimal data curation
processes, (3) lack of reasoning capabilities tailored for complex medical
scenarios. To address these challenges, we first propose a comprehensive data
curation procedure that (1) efficiently acquires rich medical knowledge data
not only from medical imaging but also from extensive medical texts and
general-domain data; and (2) synthesizes accurate medical captions, visual
question answering (VQA), and reasoning samples. As a result, we build a
multimodal dataset enriched with extensive medical knowledge. Building on the
curated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu
undergoes multi-stage training to embed medical expertise and enhance its
task-solving capabilities progressively. Besides, we preliminarily explore the
potential of applying reinforcement learning with verifiable rewards paradigm
to enhance Lingshu's medical reasoning ability. Additionally, we develop
MedEvalKit, a unified evaluation framework that consolidates leading multimodal
and textual medical benchmarks for standardized, fair, and efficient model
assessment. We evaluate the performance of Lingshu on three fundamental medical
tasks, multimodal QA, text-based QA, and medical report generation. The results
show that Lingshu consistently outperforms the existing open-source multimodal
models on most tasks ...