ChatPaper.aiChatPaper

MedXpertQA:專家級醫學推理和理解的基準設定

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

January 30, 2025
作者: Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, Bowen Zhou
cs.AI

摘要

我們介紹了 MedXpertQA,這是一個極具挑戰性和全面性的基準,用於評估專家級醫學知識和高級推理能力。MedXpertQA 包含 4,460 個問題,涵蓋了 17 個專業領域和 11 個身體系統。它包括兩個子集,Text 用於文本評估,MM 用於多模態評估。值得注意的是,MM 引入了具有多樣圖像和豐富臨床信息的專家級考試問題,包括病人記錄和檢查結果,使其與僅從圖像標題生成的傳統醫學多模態基準有所不同。MedXpertQA 通過嚴格的篩選和擴充來解決現有基準(如 MedQA)的不足難度問題,並納入專科委員會問題以提高臨床相關性和全面性。我們進行數據合成以減輕數據泄漏風險,並進行多輪專家審查以確保準確性和可靠性。我們在 MedXpertQA 上評估了 16 個領先模型。此外,醫學與現實世界的決策密切相關,為評估超越數學和代碼的推理能力提供了豐富和具代表性的場景。為此,我們開發了一個以推理為導向的子集,以促進對 o1-like 模型的評估。
English
We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation. Notably, MM introduces expert-level exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions. MedXpertQA applies rigorous filtering and augmentation to address the insufficient difficulty of existing benchmarks like MedQA, and incorporates specialty board questions to improve clinical relevance and comprehensiveness. We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert reviews to ensure accuracy and reliability. We evaluate 16 leading models on MedXpertQA. Moreover, medicine is deeply connected to real-world decision-making, providing a rich and representative setting for assessing reasoning abilities beyond mathematics and code. To this end, we develop a reasoning-oriented subset to facilitate the assessment of o1-like models.

Summary

AI-Generated Summary

PDF222January 31, 2025