MedTrinity-25M:一个包含多模态数据和多粒度标注的大规模医学数据集

MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine

August 6, 2024
作者: Yunfei Xie, Ce Zhou, Lang Gao, Juncheng Wu, Xianhang Li, Hong-Yu Zhou, Sheng Liu, Lei Xing, James Zou, Cihang Xie, Yuyin Zhou
cs.AI

摘要

本文介绍了MedTrinity-25M,这是一个涵盖超过2500万张图像的综合大规模多模态医学数据集,涵盖了10种模态,针对65多种疾病提供了多粒度的注释。这些丰富的注释包括全局文本信息,如疾病/病变类型、模态、特定区域描述以及区域间关系,以及针对感兴趣区域(ROI)的详细局部注释,包括边界框、分割掩模。与现有方法不同,该方法不受图像-文本配对可用性的限制,我们开发了第一个自动化流水线,通过生成多粒度的视觉和文本注释(以图像-ROI-描述三元组的形式)来扩展多模态数据,而无需任何配对文本描述。具体而言,我们收集了来自90多个不同来源的数据,经过预处理,并利用领域专家模型对与异常区域相关的ROI进行了确定。然后,我们构建了一个全面的知识库,并促使多模态大型语言模型执行检索增强生成,以确定的ROI作为指导,生成多粒度的文本描述。与现有数据集相比,MedTrinity-25M提供了最丰富的注释,支持一系列多模态任务,如字幕生成和报告生成,以及视觉中心任务,如分类和分割。在MedTrinity-25M上进行预训练后,我们的模型在VQA-RAD和PathVQA上实现了最先进的性能,超越了多模态大型语言模型和其他代表性的最先进方法。该数据集还可用于支持大规模预训练多模态医学AI模型,有助于发展未来医学领域的基础模型。
English
This paper introduces MedTrinity-25M, a comprehensive, large-scale multimodal dataset for medicine, covering over 25 million images across 10 modalities, with multigranular annotations for more than 65 diseases. These enriched annotations encompass both global textual information, such as disease/lesion type, modality, region-specific descriptions, and inter-regional relationships, as well as detailed local annotations for regions of interest (ROIs), including bounding boxes, segmentation masks. Unlike existing approach which is limited by the availability of image-text pairs, we have developed the first automated pipeline that scales up multimodal data by generating multigranular visual and texual annotations (in the form of image-ROI-description triplets) without the need for any paired text descriptions. Specifically, data from over 90 different sources have been collected, preprocessed, and grounded using domain-specific expert models to identify ROIs related to abnormal regions. We then build a comprehensive knowledge base and prompt multimodal large language models to perform retrieval-augmented generation with the identified ROIs as guidance, resulting in multigranular texual descriptions. Compared to existing datasets, MedTrinity-25M provides the most enriched annotations, supporting a comprehensive range of multimodal tasks such as captioning and report generation, as well as vision-centric tasks like classification and segmentation. Pretraining on MedTrinity-25M, our model achieves state-of-the-art performance on VQA-RAD and PathVQA, surpassing both multimodal large language models and other representative SoTA approaches. This dataset can also be utilized to support large-scale pre-training of multimodal medical AI models, contributing to the development of future foundation models in the medical domain.

Summary

AI-Generated Summary

PDF302November 28, 2024