健康的大型語言模型？評估LLM對英國政府公共衛生資訊的掌握程度

摘要

随着大型语言模型（LLMs）的广泛普及，深入理解其在特定领域内的知识对于实际应用的成功至关重要。这一点在公共卫生领域尤为关键，因为未能检索到相关、准确且最新的信息可能会对英国居民产生重大影响。然而，目前关于LLMs对英国政府公共卫生信息的了解知之甚少。为解决这一问题，本文引入了一个新的基准测试——PubHealthBench，包含超过8000个问题，用于评估LLMs在公共卫生查询上的多项选择题回答（MCQA）和自由形式回答，这些问题通过自动化流程生成。我们还发布了一个新的数据集，包含用于PubHealthBench的英国政府公共卫生指导文档的提取文本。通过对24个LLMs在PubHealthBench上的评估，我们发现最新的私有LLMs（GPT-4.5、GPT-4.1和o1）具有高度的知识掌握，在MCQA设置中得分超过90%，并且优于仅使用搜索引擎进行粗略搜索的人类。然而，在自由形式回答设置中，我们观察到较低的性能，没有模型的得分超过75%。因此，尽管有迹象表明最先进的（SOTA）LLMs正成为越来越准确的公共卫生信息来源，但在提供公共卫生主题的自由形式回答时，可能仍需要额外的保障措施或工具。

English

As Large Language Models (LLMs) become widely accessible, a detailed understanding of their knowledge within specific domains becomes necessary for successful real world use. This is particularly critical in public health, where failure to retrieve relevant, accurate, and current information could significantly impact UK residents. However, currently little is known about LLM knowledge of UK Government public health information. To address this issue, this paper introduces a new benchmark, PubHealthBench, with over 8000 questions for evaluating LLMs' Multiple Choice Question Answering (MCQA) and free form responses to public health queries, created via an automated pipeline. We also release a new dataset of the extracted UK Government public health guidance documents used as source text for PubHealthBench. Assessing 24 LLMs on PubHealthBench we find the latest private LLMs (GPT-4.5, GPT-4.1 and o1) have a high degree of knowledge, achieving >90% in the MCQA setup, and outperform humans with cursory search engine use. However, in the free form setup we see lower performance with no model scoring >75%. Therefore, whilst there are promising signs that state of the art (SOTA) LLMs are an increasingly accurate source of public health information, additional safeguards or tools may still be needed when providing free form responses on public health topics.