M3Retrieve：医学多模态检索基准测试

摘要

随着检索增强生成（RAG）技术的日益普及，强大的检索模型变得前所未有的重要。在医疗领域，结合文本与图像信息的多模态检索模型，在诸如问答、跨模态检索及多模态摘要等众多下游任务中展现出显著优势，因为医疗数据往往同时包含这两种形式。然而，目前尚缺乏一个标准基准来评估这些模型在医疗环境中的表现。为填补这一空白，我们推出了M3Retrieve——一个多模态医疗检索基准。M3Retrieve覆盖5大领域、16个医疗专业及4项具体任务，包含超过120万份文本文档和16.4万条多模态查询，所有数据均在授权许可下收集。我们在此基准上评估了领先的多模态检索模型，以探究不同医疗专业特有的挑战及其对检索性能的影响。通过发布M3Retrieve，我们旨在促进系统化评估，激发模型创新，并加速构建更强大、更可靠的多模态医疗检索系统的研究进程。数据集及基线代码已发布于GitHub页面：https://github.com/AkashGhosh/M3Retrieve。

English

With the increasing use of RetrievalAugmented Generation (RAG), strong retrieval models have become more important than ever. In healthcare, multimodal retrieval models that combine information from both text and images offer major advantages for many downstream tasks such as question answering, cross-modal retrieval, and multimodal summarization, since medical data often includes both formats. However, there is currently no standard benchmark to evaluate how well these models perform in medical settings. To address this gap, we introduce M3Retrieve, a Multimodal Medical Retrieval Benchmark. M3Retrieve, spans 5 domains,16 medical fields, and 4 distinct tasks, with over 1.2 Million text documents and 164K multimodal queries, all collected under approved licenses. We evaluate leading multimodal retrieval models on this benchmark to explore the challenges specific to different medical specialities and to understand their impact on retrieval performance. By releasing M3Retrieve, we aim to enable systematic evaluation, foster model innovation, and accelerate research toward building more capable and reliable multimodal retrieval systems for medical applications. The dataset and the baselines code are available in this github page https://github.com/AkashGhosh/M3Retrieve.