MLLMを用いた離散拡散モデルによる統合医療マルチモーダル生成

要旨

近年の生成医療モデルの進展は、モダリティ固有のシナリオに制約されており、画像、病理、臨床ノートからの補完的な証拠の統合を妨げています。この断片化により、生物医学データの全スペクトルにわたって学習し推論する基盤モデルへの進化が制限されています。本研究では、モダリティ固有のコンポーネントなしにモダリティ間で共有分布を学習する初の医療離散拡散モデルであるMeDiMを提案します。MeDiMは、画像とテキスト間の翻訳、およびプロンプトに応じたドメイン横断的な画像-レポートペアの共同生成といった複数の生成タスクを統合します。離散拡散フレームワークに基づいて構築されたMeDiMは、共有確率空間を通じて視覚と言語表現を橋渡しします。統一された柔軟な医療生成を可能にするため、マルチモーダル大規模言語モデル（MLLM）を拡散バックボーンとして採用し、その事前知識とクロスモーダル推論を活用します。2つの主要な設計が導入されています：（1）双方向コンテキストのための因果的注意マスクの除去、（2）拡散認識のための連続タイムステップ埋め込みの注入。実験では、高忠実度の医療生成（MIMIC-CXRでのFID 16.60、PathGenでのFID 24.19）と正確なレポート生成（METEOR 0.2650および0.2580）が実証されました。共同生成された画像-レポートペアは、下流のパフォーマンスをさらに向上させ（BLEU-1で6.43％、BLEU-2で18.57％、BLEU-3で31.58％、METEORで4.80％の向上）、MeDiMが一貫性があり臨床的に根拠のあるマルチモーダル出力をサポートすることを示しています。

English

Recent advances in generative medical models are constrained by modality-specific scenarios that hinder the integration of complementary evidence from imaging, pathology, and clinical notes. This fragmentation limits their evolution into foundation models that can learn and reason across the full spectrum of biomedical data. We propose MeDiM, the first medical discrete diffusion model that learns shared distributions across modalities without modality-specific components. MeDiM unifies multiple generative tasks: translating between images and text, and jointly producing image-report pairs across domains in response to prompts. Built on a discrete diffusion framework, MeDiM bridges vision and language representations through a shared probabilistic space. To enable unified and flexible medical generation, we employ a multimodal large language model (MLLM) as the diffusion backbone, leveraging its prior knowledge and cross-modal reasoning. Two key designs are introduced: (1) removing the causal attention mask for bidirectional context, and (2) injecting continuous timestep embeddings for diffusion awareness. Experiments demonstrate high-fidelity medical generation (FID 16.60 on MIMIC-CXR and FID 24.19 on PathGen) and accurate report generation (METEOR 0.2650 and 0.2580). Jointly generated image-report pairs further enhance downstream performance (plus6.43 percent BLEU-1, plus18.57 percent BLEU-2, plus31.58 percent BLEU-3, plus4.80 percent METEOR), showing that MeDiM supports coherent and clinically grounded multimodal outputs.

MLLMを用いた離散拡散モデルによる統合医療マルチモーダル生成

Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation

要旨

Support