ChatPaper.aiChatPaper

SMMILE:专家指导下的多模态医疗上下文学习基准

SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning

June 26, 2025
作者: Melanie Rieff, Maya Varma, Ossian Rabow, Subathra Adithan, Julie Kim, Ken Chang, Hannah Lee, Nidhi Rohatgi, Christian Bluethgen, Mohamed S. Muneer, Jean-Benoit Delbrouck, Michael Moor
cs.AI

摘要

尽管在医学等领域具有显著潜力,多模态上下文学习(ICL)仍未被充分探索。临床医生经常遇到需要从有限示例中适应的多样化、专业化任务,例如从少量相关既往案例中提炼见解或考虑一组有限的鉴别诊断。虽然多模态大语言模型(MLLMs)在医学视觉问答(VQA)方面取得了进展,但它们从上下文中学习多模态任务的能力在很大程度上仍是未知的。我们引入了SMMILE,这是首个针对医学任务的专家驱动的多模态ICL基准。十一位医学专家精心策划了问题集,每个问题包含一个多模态查询和多模态上下文示例作为任务演示。SMMILE涵盖了111个问题(517个问题-图像-答案三元组),涉及6个医学专业和13种成像模式。我们进一步推出了SMMILE++,这是一个包含1038个排列问题的增强版本。对15个MLLMs的全面评估显示,大多数模型在医学任务中的多模态ICL能力处于中等至较差水平。在开放式评估中,ICL相较于零样本学习在SMMILE上仅带来8%的平均提升,在SMMILE++上为9.4%。我们观察到模型对无关上下文示例的敏感性:即使是一个噪声或无关的示例,也可能导致性能下降高达9.5%。此外,示例排序显示出近因偏差,即把最相关的示例放在最后可以带来高达71%的性能提升。我们的研究结果揭示了当前MLLMs在从上下文中学习多模态医学任务时的关键局限性和偏差。
English
Multimodal in-context learning (ICL) remains underexplored despite significant potential for domains such as medicine. Clinicians routinely encounter diverse, specialized tasks requiring adaptation from limited examples, such as drawing insights from a few relevant prior cases or considering a constrained set of differential diagnoses. While multimodal large language models (MLLMs) have shown advances in medical visual question answering (VQA), their ability to learn multimodal tasks from context is largely unknown. We introduce SMMILE, the first expert-driven multimodal ICL benchmark for medical tasks. Eleven medical experts curated problems, each including a multimodal query and multimodal in-context examples as task demonstrations. SMMILE encompasses 111 problems (517 question-image-answer triplets) covering 6 medical specialties and 13 imaging modalities. We further introduce SMMILE++, an augmented variant with 1038 permuted problems. A comprehensive evaluation of 15 MLLMs demonstrates that most models exhibit moderate to poor multimodal ICL ability in medical tasks. In open-ended evaluations, ICL contributes only 8% average improvement over zero-shot on SMMILE and 9.4% on SMMILE++. We observe a susceptibility for irrelevant in-context examples: even a single noisy or irrelevant example can degrade performance by up to 9.5%. Moreover, example ordering exhibits a recency bias, i.e., placing the most relevant example last can lead to substantial performance improvements by up to 71%. Our findings highlight critical limitations and biases in current MLLMs when learning multimodal medical tasks from context.
PDF71June 30, 2025