ChatPaper.aiChatPaper

Unimedvl:透過觀察-知識-分析統一醫學多模態理解與生成

Unimedvl: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis

October 17, 2025
作者: Junzhi Ning, Wei Li, Cheng Tang, Jiashi Lin, Chenglong Ma, Chaoyang Zhang, Jiyao Liu, Ying Chen, Shujian Gao, Lihao Liu, Yuandong Pu, Huihui Xu, Chenhui Gou, Ziyan Huang, Yi Xin, Qi Qin, Zhongying Deng, Diping Song, Bin Fu, Guang Yang, Yuanfeng Ji, Tianbin Li, Yanzhou Su, Jin Ye, Shixiang Tang, Ming Hu, Junjun He
cs.AI

摘要

醫療診斷應用需要能夠處理多模態醫療輸入(影像、病歷、實驗室結果)並生成多樣化輸出的模型,包括文本報告和視覺內容(註釋、分割遮罩和影像)。儘管存在這一需求,現有的醫療AI系統卻打破了這一統一流程:醫療影像理解模型能解讀影像但無法生成視覺輸出,而醫療影像生成模型能合成影像卻無法提供文本解釋。這導致了數據表示、特徵整合及任務層面多模態能力的缺失。為此,我們提出了一個多層次框架,該框架通過觀察-知識-分析(OKA)範式從診斷工作流程中汲取靈感。具體而言,在觀察層面,我們構建了UniMed-5M數據集,包含超過560萬個樣本,將多樣化的單模態數據重新格式化為多模態配對,以支持基礎觀察。在知識層面,我們提出了漸進式課程學習,系統性地引入醫療多模態知識。在分析層面,我們引入了UniMedVL,這是首個醫療統一多模態模型,能在單一架構內同時分析影像理解與生成任務。UniMedVL在五個醫療影像理解基準測試中表現卓越,同時在八種醫療影像模態的生成質量上與專業模型相當。關鍵在於,我們的統一架構實現了雙向知識共享:生成任務增強了視覺理解特徵,這表明在單一醫療框架內整合傳統上分離的能力,能夠在多樣化的醫療視覺-語言任務中實現性能提升。代碼可在https://github.com/uni-medical/UniMedVL獲取。
English
Medical diagnostic applications require models that can process multimodal medical inputs (images, patient histories, lab results) and generate diverse outputs including both textual reports and visual content (annotations, segmentation masks, and images). Despite this need, existing medical AI systems disrupt this unified process: medical image understanding models interpret images but cannot generate visual outputs, while medical image generation models synthesize images but cannot provide textual explanations. This leads to gaps in data representation, feature integration, and task-level multimodal capabilities. To this end, we propose a multi-level framework that draws inspiration from diagnostic workflows through the Observation-Knowledge-Analysis (OKA) paradigm. Specifically, at the observation level, we construct UniMed-5M, a dataset comprising over 5.6M samples that reformat diverse unimodal data into multimodal pairs for foundational observation. At the knowledge level, we propose Progressive Curriculum Learning that systematically introduces medical multimodal knowledge. At the analysis level, we introduce UniMedVL, the first medical unified multimodal model for the simultaneous analysis of image understanding and generation tasks within a single architecture. UniMedVL achieves superior performance on five medical image understanding benchmarks, while matching specialized models in generation quality across eight medical imaging modalities. Crucially, our unified architecture enables bidirectional knowledge sharing: generation tasks enhance visual understanding features, demonstrating that integrating traditionally separate capabilities within a single medical framework unlocks improvements across diverse medical vision-language tasks. Code is available at https://github.com/uni-medical/UniMedVL.
PDF42October 22, 2025