基于MLLM的离散扩散模型实现统一医学多模态生成
Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation
October 7, 2025
作者: Jiawei Mao, Yuhan Wang, Lifeng Chen, Can Zhao, Yucheng Tang, Dong Yang, Liangqiong Qu, Daguang Xu, Yuyin Zhou
cs.AI
摘要
近期生成式医疗模型的进展受限于特定模态场景,阻碍了影像、病理和临床笔记等互补证据的整合。这种碎片化限制了它们向能够跨生物医学数据全谱系学习和推理的基础模型演进。我们提出了MeDiM,首个无需特定模态组件即可跨模态学习共享分布的医疗离散扩散模型。MeDiM统一了多项生成任务:实现图像与文本间的互译,并响应提示跨领域联合生成图像-报告对。基于离散扩散框架,MeDiM通过共享概率空间桥接视觉与语言表征。为实现统一且灵活的医疗生成,我们采用多模态大语言模型(MLLM)作为扩散主干,利用其先验知识与跨模态推理能力。两项关键设计被引入:(1)移除因果注意力掩码以实现双向上下文;(2)注入连续时间步嵌入以增强扩散感知。实验展示了高保真医疗生成(MIMIC-CXR上FID 16.60,PathGen上FID 24.19)及精准报告生成(METEOR 0.2650和0.2580)。联合生成的图像-报告对进一步提升了下游性能(BLEU-1提升6.43%,BLEU-2提升18.57%,BLEU-3提升31.58%,METEOR提升4.80%),表明MeDiM支持连贯且临床依据充分的多模态输出。
English
Recent advances in generative medical models are constrained by
modality-specific scenarios that hinder the integration of complementary
evidence from imaging, pathology, and clinical notes. This fragmentation limits
their evolution into foundation models that can learn and reason across the
full spectrum of biomedical data. We propose MeDiM, the first medical discrete
diffusion model that learns shared distributions across modalities without
modality-specific components. MeDiM unifies multiple generative tasks:
translating between images and text, and jointly producing image-report pairs
across domains in response to prompts. Built on a discrete diffusion framework,
MeDiM bridges vision and language representations through a shared
probabilistic space. To enable unified and flexible medical generation, we
employ a multimodal large language model (MLLM) as the diffusion backbone,
leveraging its prior knowledge and cross-modal reasoning. Two key designs are
introduced: (1) removing the causal attention mask for bidirectional context,
and (2) injecting continuous timestep embeddings for diffusion awareness.
Experiments demonstrate high-fidelity medical generation (FID 16.60 on
MIMIC-CXR and FID 24.19 on PathGen) and accurate report generation (METEOR
0.2650 and 0.2580). Jointly generated image-report pairs further enhance
downstream performance (plus6.43 percent BLEU-1, plus18.57 percent BLEU-2,
plus31.58 percent BLEU-3, plus4.80 percent METEOR), showing that MeDiM supports
coherent and clinically grounded multimodal outputs.