Unimedvl:通过观察-知识-分析统一医学多模态理解与生成
Unimedvl: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis
October 17, 2025
作者: Junzhi Ning, Wei Li, Cheng Tang, Jiashi Lin, Chenglong Ma, Chaoyang Zhang, Jiyao Liu, Ying Chen, Shujian Gao, Lihao Liu, Yuandong Pu, Huihui Xu, Chenhui Gou, Ziyan Huang, Yi Xin, Qi Qin, Zhongying Deng, Diping Song, Bin Fu, Guang Yang, Yuanfeng Ji, Tianbin Li, Yanzhou Su, Jin Ye, Shixiang Tang, Ming Hu, Junjun He
cs.AI
摘要
医疗诊断应用需要能够处理多模态医疗输入(影像、病历、实验室结果)并生成多样化输出的模型,包括文本报告和视觉内容(标注、分割掩码和图像)。尽管存在这一需求,现有的医疗AI系统却打破了这一统一流程:医疗影像理解模型能解读影像但无法生成视觉输出,而医疗影像生成模型能合成图像却无法提供文本解释。这导致了数据表示、特征整合及任务级多模态能力的缺失。为此,我们提出一个多层次框架,借鉴诊断工作流程中的观察-知识-分析(OKA)范式。具体而言,在观察层面,我们构建了UniMed-5M数据集,包含超过560万样本,将多样化的单模态数据重新格式化为多模态对,以支持基础观察。在知识层面,我们提出了渐进式课程学习,系统性地引入医疗多模态知识。在分析层面,我们推出了UniMedVL,首个医疗统一多模态模型,能在单一架构内同时执行影像理解与生成任务的分析。UniMedVL在五项医疗影像理解基准测试中表现卓越,同时在八种医疗影像模态的生成质量上与专业模型相当。尤为关键的是,我们的统一架构实现了双向知识共享:生成任务增强了视觉理解特征,表明将传统上分离的能力整合于单一医疗框架内,能够解锁多种医疗视觉-语言任务的改进潜力。代码已发布于https://github.com/uni-medical/UniMedVL。
English
Medical diagnostic applications require models that can process multimodal
medical inputs (images, patient histories, lab results) and generate diverse
outputs including both textual reports and visual content (annotations,
segmentation masks, and images). Despite this need, existing medical AI systems
disrupt this unified process: medical image understanding models interpret
images but cannot generate visual outputs, while medical image generation
models synthesize images but cannot provide textual explanations. This leads to
gaps in data representation, feature integration, and task-level multimodal
capabilities. To this end, we propose a multi-level framework that draws
inspiration from diagnostic workflows through the
Observation-Knowledge-Analysis (OKA) paradigm. Specifically, at the observation
level, we construct UniMed-5M, a dataset comprising over 5.6M samples that
reformat diverse unimodal data into multimodal pairs for foundational
observation. At the knowledge level, we propose Progressive Curriculum Learning
that systematically introduces medical multimodal knowledge. At the analysis
level, we introduce UniMedVL, the first medical unified multimodal model for
the simultaneous analysis of image understanding and generation tasks within a
single architecture. UniMedVL achieves superior performance on five medical
image understanding benchmarks, while matching specialized models in generation
quality across eight medical imaging modalities. Crucially, our unified
architecture enables bidirectional knowledge sharing: generation tasks enhance
visual understanding features, demonstrating that integrating traditionally
separate capabilities within a single medical framework unlocks improvements
across diverse medical vision-language tasks. Code is available at
https://github.com/uni-medical/UniMedVL.