ChatPaper.aiChatPaper

SMMILE:專家導向的多模態醫學情境學習基準

SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning

June 26, 2025
作者: Melanie Rieff, Maya Varma, Ossian Rabow, Subathra Adithan, Julie Kim, Ken Chang, Hannah Lee, Nidhi Rohatgi, Christian Bluethgen, Mohamed S. Muneer, Jean-Benoit Delbrouck, Michael Moor
cs.AI

摘要

多模態情境學習(ICL)在醫學等領域具有顯著潛力,但至今仍未被充分探索。臨床醫生經常面臨多樣化且專業化的任務,這些任務需要從有限的例子中進行適應,例如從少數相關的先前案例中汲取見解,或考慮一組有限的鑑別診斷。儘管多模態大型語言模型(MLLMs)在醫學視覺問答(VQA)方面已顯示出進步,但其從情境中學習多模態任務的能力在很大程度上仍是未知的。我們引入了SMMILE,這是首個由專家驅動的醫學任務多模態ICL基準。十一位醫學專家精心設計了問題,每個問題包括一個多模態查詢和多模態情境示例作為任務演示。SMMILE涵蓋了111個問題(517個問題-圖像-答案三元組),涉及6個醫學專科和13種成像模態。我們進一步引入了SMMILE++,這是一個包含1038個排列問題的增強版本。對15個MLLMs的全面評估表明,大多數模型在醫學任務中的多模態ICL能力表現為中等至較差。在開放式評估中,ICL在SMMILE上僅帶來8%的平均改進,在SMMILE++上為9.4%。我們觀察到對不相關情境示例的敏感性:即使是一個噪聲或不相關的示例,也可能使性能下降高達9.5%。此外,示例排序顯示出近因偏見,即將最相關的示例放在最後可以帶來高達71%的性能提升。我們的研究結果突顯了當前MLLMs在從情境中學習多模態醫學任務時的關鍵限制和偏見。
English
Multimodal in-context learning (ICL) remains underexplored despite significant potential for domains such as medicine. Clinicians routinely encounter diverse, specialized tasks requiring adaptation from limited examples, such as drawing insights from a few relevant prior cases or considering a constrained set of differential diagnoses. While multimodal large language models (MLLMs) have shown advances in medical visual question answering (VQA), their ability to learn multimodal tasks from context is largely unknown. We introduce SMMILE, the first expert-driven multimodal ICL benchmark for medical tasks. Eleven medical experts curated problems, each including a multimodal query and multimodal in-context examples as task demonstrations. SMMILE encompasses 111 problems (517 question-image-answer triplets) covering 6 medical specialties and 13 imaging modalities. We further introduce SMMILE++, an augmented variant with 1038 permuted problems. A comprehensive evaluation of 15 MLLMs demonstrates that most models exhibit moderate to poor multimodal ICL ability in medical tasks. In open-ended evaluations, ICL contributes only 8% average improvement over zero-shot on SMMILE and 9.4% on SMMILE++. We observe a susceptibility for irrelevant in-context examples: even a single noisy or irrelevant example can degrade performance by up to 9.5%. Moreover, example ordering exhibits a recency bias, i.e., placing the most relevant example last can lead to substantial performance improvements by up to 71%. Our findings highlight critical limitations and biases in current MLLMs when learning multimodal medical tasks from context.
PDF71June 30, 2025