Lingshu: 통합 다중모드 의료 이해 및 추론을 위한 범용 기초 모델

초록

다중모드 대형 언어 모델(MLLMs)은 대규모 데이터셋과 고급 훈련 전략 덕분에 일반적인 시각 요소를 이해하는 데 있어 인상적인 능력을 보여주고 있다. 그러나 의료 응용 분야에서의 효과성은 의료 시나리오와 일반 도메인 간의 데이터 및 작업에서 발생하는 본질적인 차이로 인해 여전히 제한적이다. 구체적으로, 기존의 의료 MLLMs는 다음과 같은 중요한 한계에 직면해 있다: (1) 영상 이외의 의료 지식에 대한 제한된 커버리지, (2) 최적화되지 않은 데이터 큐레이션 프로세스로 인한 환각 현상에 대한 높은 취약성, (3) 복잡한 의료 시나리오에 맞춤화된 추론 능력의 부족. 이러한 문제를 해결하기 위해, 우리는 먼저 (1) 의료 영상뿐만 아니라 광범위한 의료 텍스트 및 일반 도메인 데이터에서 풍부한 의료 지식 데이터를 효율적으로 획득하고, (2) 정확한 의료 캡션, 시각적 질의응답(VQA), 그리고 추론 샘플을 합성하는 포괄적인 데이터 큐레이션 절차를 제안한다. 이를 통해 광범위한 의료 지식이 풍부하게 포함된 다중모드 데이터셋을 구축한다. 큐레이션된 데이터를 기반으로, 우리는 의료 전문 MLLM인 Lingshu를 소개한다. Lingshu는 의료 전문 지식을 내재화하고 작업 해결 능력을 점진적으로 강화하기 위해 다단계 훈련을 거친다. 또한, 우리는 검증 가능한 보상 패러다임을 적용한 강화 학습의 잠재력을 예비적으로 탐구하여 Lingshu의 의료 추론 능력을 향상시킨다. 추가적으로, 우리는 표준화되고 공정하며 효율적인 모델 평가를 위해 선도적인 다중모드 및 텍스트 기반 의료 벤치마크를 통합한 통합 평가 프레임워크인 MedEvalKit을 개발한다. 우리는 Lingshu의 성능을 다중모드 QA, 텍스트 기반 QA, 그리고 의료 보고서 생성이라는 세 가지 기본 의료 작업에서 평가한다. 결과는 Lingshu가 대부분의 작업에서 기존의 오픈소스 다중모드 모델들을 꾸준히 능가함을 보여준다...

English

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in understanding common visual elements, largely due to their large-scale datasets and advanced training strategies. However, their effectiveness in medical applications remains limited due to the inherent discrepancies between data and tasks in medical scenarios and those in the general domain. Concretely, existing medical MLLMs face the following critical limitations: (1) limited coverage of medical knowledge beyond imaging, (2) heightened susceptibility to hallucinations due to suboptimal data curation processes, (3) lack of reasoning capabilities tailored for complex medical scenarios. To address these challenges, we first propose a comprehensive data curation procedure that (1) efficiently acquires rich medical knowledge data not only from medical imaging but also from extensive medical texts and general-domain data; and (2) synthesizes accurate medical captions, visual question answering (VQA), and reasoning samples. As a result, we build a multimodal dataset enriched with extensive medical knowledge. Building on the curated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities progressively. Besides, we preliminarily explore the potential of applying reinforcement learning with verifiable rewards paradigm to enhance Lingshu's medical reasoning ability. Additionally, we develop MedEvalKit, a unified evaluation framework that consolidates leading multimodal and textual medical benchmarks for standardized, fair, and efficient model assessment. We evaluate the performance of Lingshu on three fundamental medical tasks, multimodal QA, text-based QA, and medical report generation. The results show that Lingshu consistently outperforms the existing open-source multimodal models on most tasks ...

Lingshu: 통합 다중모드 의료 이해 및 추론을 위한 범용 기초 모델

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

초록

Support