EpiQAL:流行病學問答領域的大型語言模型基準測試——促進對齊與推理能力的提升
EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning
January 6, 2026
作者: Mingyang Wei, Dehai Min, Zewen Liu, Yuzhang Xie, Guanchen Wu, Carl Yang, Max S. Y. Lau, Qi He, Lu Cheng, Wei Jin
cs.AI
摘要
可靠的流行病學推理需要綜合研究證據,以推斷群體層面的疾病負擔、傳播動態和干預效果。現有醫學問答基準主要側重臨床知識或患者層面推理,但鮮少系統性評估基於證據的流行病學推斷。我們提出首個跨疾病流行病學問答診斷基準EpiQAL,該基準基於開放獲取文獻構建三個子集:分別評估文本基礎事實回溯、結合文獻證據與流行病學原理的多步推理,以及隱去討論部分後的結論重構。基準構建融合專家設計的分類指南、多模型驗證和基於檢索的難度控制。對十個開放模型的實驗表明,當前大語言模型在流行病學推理上表現有限,其中多步推理挑戰最大。模型排名隨子集變化,且規模不能單獨預測成功。思維鏈提示有益於多步推理,但在其他任務中效果參差。EpiQAL為證據錨定、推理性思維和結論重構提供細粒度診斷信號。
English
Reliable epidemiological reasoning requires synthesizing study evidence to infer disease burden, transmission dynamics, and intervention effects at the population level. Existing medical question answering benchmarks primarily emphasize clinical knowledge or patient-level reasoning, yet few systematically evaluate evidence-grounded epidemiological inference. We present EpiQAL, the first diagnostic benchmark for epidemiological question answering across diverse diseases, comprising three subsets built from open-access literature. The subsets respectively evaluate text-grounded factual recall, multi-step inference linking document evidence with epidemiological principles, and conclusion reconstruction with the Discussion section withheld. Construction combines expert-designed taxonomy guidance, multi-model verification, and retrieval-based difficulty control. Experiments on ten open models reveal that current LLMs show limited performance on epidemiological reasoning, with multi-step inference posing the greatest challenge. Model rankings shift across subsets, and scale alone does not predict success. Chain-of-Thought prompting benefits multi-step inference but yields mixed results elsewhere. EpiQAL provides fine-grained diagnostic signals for evidence grounding, inferential reasoning, and conclusion reconstruction.