健康的大语言模型？评估大语言模型对英国政府公共卫生信息的掌握程度

摘要

随着大型语言模型（LLMs）的广泛普及，深入理解其在特定领域内的知识对于实际应用的成功至关重要。这一点在公共卫生领域尤为关键，因为未能检索到相关、准确且最新的信息可能会对英国居民产生重大影响。然而，目前关于LLMs对英国政府公共卫生信息掌握程度的研究尚不多见。为解决这一问题，本文引入了一个新的基准测试——PubHealthBench，包含超过8000个问题，用于评估LLMs在多项选择题解答（MCQA）及自由形式回答公共卫生查询方面的表现，该基准通过自动化流程创建。同时，我们发布了一个新数据集，包含用于构建PubHealthBench的英国政府公共卫生指导文件提取文本。通过对24个LLMs在PubHealthBench上的评估，我们发现最新的私有LLMs（如GPT-4.5、GPT-4.1及o1）展现出较高知识水平，在MCQA设置中得分超过90%，并优于仅进行简单搜索引擎查询的人类。然而，在自由回答设置中，所有模型的表现均未超过75%。因此，尽管有迹象表明最先进的（SOTA）LLMs正成为越来越准确的公共卫生信息来源，但在提供公共卫生主题的自由形式回答时，可能仍需额外的保障措施或工具。

English

As Large Language Models (LLMs) become widely accessible, a detailed understanding of their knowledge within specific domains becomes necessary for successful real world use. This is particularly critical in public health, where failure to retrieve relevant, accurate, and current information could significantly impact UK residents. However, currently little is known about LLM knowledge of UK Government public health information. To address this issue, this paper introduces a new benchmark, PubHealthBench, with over 8000 questions for evaluating LLMs' Multiple Choice Question Answering (MCQA) and free form responses to public health queries, created via an automated pipeline. We also release a new dataset of the extracted UK Government public health guidance documents used as source text for PubHealthBench. Assessing 24 LLMs on PubHealthBench we find the latest private LLMs (GPT-4.5, GPT-4.1 and o1) have a high degree of knowledge, achieving >90% in the MCQA setup, and outperform humans with cursory search engine use. However, in the free form setup we see lower performance with no model scoring >75%. Therefore, whilst there are promising signs that state of the art (SOTA) LLMs are an increasingly accurate source of public health information, additional safeguards or tools may still be needed when providing free form responses on public health topics.