Lingshu：統合されたマルチモーダル医療理解と推論のための汎用基盤モデル

要旨

マルチモーダル大規模言語モデル（MLLMs）は、大規模なデータセットと高度なトレーニング戦略により、一般的な視覚要素の理解において印象的な能力を発揮しています。しかし、医療分野での有効性は、医療シナリオと一般的な領域におけるデータとタスクの間の本質的な不一致により、依然として限定的です。具体的には、既存の医療MLLMsは以下の重要な制約に直面しています：（1）画像を超えた医療知識のカバレッジが限られている、（2）データキュレーションプロセスの最適化が不十分なため、幻覚（hallucination）に対する感受性が高い、（3）複雑な医療シナリオに特化した推論能力が欠如している。これらの課題に対処するため、我々はまず包括的なデータキュレーション手順を提案します。この手順は、（1）医療画像だけでなく、広範な医療テキストや一般領域のデータからも豊富な医療知識データを効率的に取得し、（2）正確な医療キャプション、視覚的質問応答（VQA）、および推論サンプルを合成します。その結果、広範な医療知識を備えたマルチモーダルデータセットを構築します。キュレーションされたデータを基に、我々は医療特化型MLLMである「Lingshu」を紹介します。Lingshuは、医療専門知識を埋め込み、タスク解決能力を段階的に強化するために、多段階のトレーニングを経ます。さらに、検証可能な報酬を用いた強化学習を適用し、Lingshuの医療推論能力を向上させる可能性を予備的に探ります。加えて、主要なマルチモーダルおよびテキストベースの医療ベンチマークを統合し、標準化された公平で効率的なモデル評価を可能にする統一評価フレームワーク「MedEvalKit」を開発します。我々は、Lingshuのパフォーマンスを、マルチモーダルQA、テキストベースQA、および医療レポート生成という3つの基本的な医療タスクで評価します。その結果、Lingshuはほとんどのタスクにおいて既存のオープンソースマルチモーダルモデルを一貫して上回ることが示されました...

English

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in understanding common visual elements, largely due to their large-scale datasets and advanced training strategies. However, their effectiveness in medical applications remains limited due to the inherent discrepancies between data and tasks in medical scenarios and those in the general domain. Concretely, existing medical MLLMs face the following critical limitations: (1) limited coverage of medical knowledge beyond imaging, (2) heightened susceptibility to hallucinations due to suboptimal data curation processes, (3) lack of reasoning capabilities tailored for complex medical scenarios. To address these challenges, we first propose a comprehensive data curation procedure that (1) efficiently acquires rich medical knowledge data not only from medical imaging but also from extensive medical texts and general-domain data; and (2) synthesizes accurate medical captions, visual question answering (VQA), and reasoning samples. As a result, we build a multimodal dataset enriched with extensive medical knowledge. Building on the curated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities progressively. Besides, we preliminarily explore the potential of applying reinforcement learning with verifiable rewards paradigm to enhance Lingshu's medical reasoning ability. Additionally, we develop MedEvalKit, a unified evaluation framework that consolidates leading multimodal and textual medical benchmarks for standardized, fair, and efficient model assessment. We evaluate the performance of Lingshu on three fundamental medical tasks, multimodal QA, text-based QA, and medical report generation. The results show that Lingshu consistently outperforms the existing open-source multimodal models on most tasks ...

Lingshu：統合されたマルチモーダル医療理解と推論のための汎用基盤モデル

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

要旨

Support