MedTrinity-25M:一個具有多層次標註的大規模多模態醫學數據集

MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine

August 6, 2024
作者: Yunfei Xie, Ce Zhou, Lang Gao, Juncheng Wu, Xianhang Li, Hong-Yu Zhou, Sheng Liu, Lei Xing, James Zou, Cihang Xie, Yuyin Zhou
cs.AI

摘要

本文介紹了MedTrinity-25M,這是一個全面的、大規模的醫學多模態數據集,涵蓋了超過2500萬張圖像,包括10種模態,並對65多種疾病進行了多層次的標註。這些豐富的標註涵蓋了全球文本信息,如疾病/病變類型、模態、特定區域描述和區域間關係,以及對感興趣區域(ROI)的詳細局部標註,包括邊界框和分割遮罩。與現有方法不同,現有方法受到圖像文本對的可用性限制,我們開發了第一個自動化流程,通過生成多層次的視覺和文本標註(以圖像-ROI-描述三元組的形式)來擴展多模態數據,而無需任何配對的文本描述。具體而言,我們收集、預處理了來自90多個不同來源的數據,並使用特定領域的專家模型來識別與異常區域相關的ROI。然後,我們構建了一個全面的知識庫,並提示多模態大型語言模型根據識別的ROI執行檢索增強生成,從而產生多層次的文本描述。與現有數據集相比,MedTrinity-25M提供了最豐富的標註,支持一系列多模態任務,如標題生成和報告生成,以及視覺中心任務,如分類和分割。在MedTrinity-25M上進行預訓練後,我們的模型在VQA-RAD和PathVQA上實現了最先進的性能,超越了多模態大型語言模型和其他代表性的最先進方法。這個數據集還可以用於支持大規模的多模態醫學AI模型的預訓練,有助於發展醫學領域未來基礎模型。
English
This paper introduces MedTrinity-25M, a comprehensive, large-scale multimodal dataset for medicine, covering over 25 million images across 10 modalities, with multigranular annotations for more than 65 diseases. These enriched annotations encompass both global textual information, such as disease/lesion type, modality, region-specific descriptions, and inter-regional relationships, as well as detailed local annotations for regions of interest (ROIs), including bounding boxes, segmentation masks. Unlike existing approach which is limited by the availability of image-text pairs, we have developed the first automated pipeline that scales up multimodal data by generating multigranular visual and texual annotations (in the form of image-ROI-description triplets) without the need for any paired text descriptions. Specifically, data from over 90 different sources have been collected, preprocessed, and grounded using domain-specific expert models to identify ROIs related to abnormal regions. We then build a comprehensive knowledge base and prompt multimodal large language models to perform retrieval-augmented generation with the identified ROIs as guidance, resulting in multigranular texual descriptions. Compared to existing datasets, MedTrinity-25M provides the most enriched annotations, supporting a comprehensive range of multimodal tasks such as captioning and report generation, as well as vision-centric tasks like classification and segmentation. Pretraining on MedTrinity-25M, our model achieves state-of-the-art performance on VQA-RAD and PathVQA, surpassing both multimodal large language models and other representative SoTA approaches. This dataset can also be utilized to support large-scale pre-training of multimodal medical AI models, contributing to the development of future foundation models in the medical domain.

Summary

AI-Generated Summary

PDF302November 28, 2024