MLLM과 결합된 이산 확산 모델을 활용한 통합 의료 멀티모달 생성

초록

최근 생성형 의료 모델의 발전은 영상, 병리학, 임상 노트 등에서 보완적 증거를 통합하는 데 방해가 되는 모달리티 특정 시나리오에 의해 제약받고 있습니다. 이러한 단편화는 생의학 데이터 전반에 걸쳐 학습하고 추론할 수 있는 기초 모델로의 진화를 제한합니다. 우리는 모달리티 특정 구성 요소 없이 여러 모달리티 간의 공유 분포를 학습하는 최초의 의료 이산 확산 모델인 MeDiM을 제안합니다. MeDiM은 이미지와 텍스트 간 번역 및 프롬프트에 대한 응답으로 도메인 간 이미지-보고서 쌍을 공동으로 생성하는 여러 생성 작업을 통합합니다. 이산 확산 프레임워크를 기반으로 구축된 MeDiM은 공유 확률 공간을 통해 시각 및 언어 표현을 연결합니다. 통합적이고 유연한 의료 생성을 가능하게 하기 위해, 우리는 사전 지식과 교차 모달리티 추론을 활용하여 다중 모달리티 대형 언어 모델(MLLM)을 확산 백본으로 사용합니다. 두 가지 주요 설계가 도입되었습니다: (1) 양방향 컨텍스트를 위한 인과적 주의 마스크 제거, (2) 확산 인식을 위한 연속 시간 단계 임베딩 주입. 실험 결과, 고충실도 의료 생성(MIMIC-CXR에서 FID 16.60, PathGen에서 FID 24.19)과 정확한 보고서 생성(METEOR 0.2650 및 0.2580)이 입증되었습니다. 공동으로 생성된 이미지-보고서 쌍은 다운스트림 성능을 더욱 향상시켰으며(BLEU-1 +6.43%, BLEU-2 +18.57%, BLEU-3 +31.58%, METEOR +4.80%), MeDiM이 일관적이고 임상적으로 근거 있는 다중 모달리티 출력을 지원함을 보여줍니다.

English

Recent advances in generative medical models are constrained by modality-specific scenarios that hinder the integration of complementary evidence from imaging, pathology, and clinical notes. This fragmentation limits their evolution into foundation models that can learn and reason across the full spectrum of biomedical data. We propose MeDiM, the first medical discrete diffusion model that learns shared distributions across modalities without modality-specific components. MeDiM unifies multiple generative tasks: translating between images and text, and jointly producing image-report pairs across domains in response to prompts. Built on a discrete diffusion framework, MeDiM bridges vision and language representations through a shared probabilistic space. To enable unified and flexible medical generation, we employ a multimodal large language model (MLLM) as the diffusion backbone, leveraging its prior knowledge and cross-modal reasoning. Two key designs are introduced: (1) removing the causal attention mask for bidirectional context, and (2) injecting continuous timestep embeddings for diffusion awareness. Experiments demonstrate high-fidelity medical generation (FID 16.60 on MIMIC-CXR and FID 24.19 on PathGen) and accurate report generation (METEOR 0.2650 and 0.2580). Jointly generated image-report pairs further enhance downstream performance (plus6.43 percent BLEU-1, plus18.57 percent BLEU-2, plus31.58 percent BLEU-3, plus4.80 percent METEOR), showing that MeDiM supports coherent and clinically grounded multimodal outputs.

MLLM과 결합된 이산 확산 모델을 활용한 통합 의료 멀티모달 생성

Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation

초록

Support