EpiQAL:面向流行病学问答的大语言模型基准测试——促进模型对齐与推理能力提升
EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning
January 6, 2026
作者: Mingyang Wei, Dehai Min, Zewen Liu, Yuzhang Xie, Guanchen Wu, Carl Yang, Max S. Y. Lau, Qi He, Lu Cheng, Wei Jin
cs.AI
摘要
可靠的流行病学推理需要综合研究证据,以推断人群层面的疾病负担、传播动态和干预效果。现有医学问答基准主要侧重临床知识或患者层面推理,但鲜有系统评估基于证据的流行病学推断能力。我们推出EpiQAL——首个跨疾病谱系的流行病学问答诊断基准,包含基于开放获取文献构建的三个子集:分别评估文本基础的事实记忆能力、结合文献证据与流行病学原理的多步推理能力,以及隐藏讨论部分后的结论重构能力。基准构建融合了专家设计的分类指导、多模型验证和基于检索的难度控制。在十个开源模型上的实验表明,当前大语言模型的流行病学推理能力有限,其中多步推理挑战最大。模型排名随子集任务发生变化,且参数量并非成功的关键预测因素。思维链提示对多步推理有助益,但在其他任务中效果参差不齐。EpiQAL为证据锚定、推断推理和结论重构提供了细粒度诊断信号。
English
Reliable epidemiological reasoning requires synthesizing study evidence to infer disease burden, transmission dynamics, and intervention effects at the population level. Existing medical question answering benchmarks primarily emphasize clinical knowledge or patient-level reasoning, yet few systematically evaluate evidence-grounded epidemiological inference. We present EpiQAL, the first diagnostic benchmark for epidemiological question answering across diverse diseases, comprising three subsets built from open-access literature. The subsets respectively evaluate text-grounded factual recall, multi-step inference linking document evidence with epidemiological principles, and conclusion reconstruction with the Discussion section withheld. Construction combines expert-designed taxonomy guidance, multi-model verification, and retrieval-based difficulty control. Experiments on ten open models reveal that current LLMs show limited performance on epidemiological reasoning, with multi-step inference posing the greatest challenge. Model rankings shift across subsets, and scale alone does not predict success. Chain-of-Thought prompting benefits multi-step inference but yields mixed results elsewhere. EpiQAL provides fine-grained diagnostic signals for evidence grounding, inferential reasoning, and conclusion reconstruction.