MedConclusion：基于结构化摘要的生物医学结论生成基准

摘要

大型语言模型（LLMs）在推理密集型研究任务中已得到广泛探索，但用于测试其能否从结构化生物医学证据中推断科学结论的资源仍然有限。我们推出MedConclusion——一个包含570万篇PubMed结构化摘要的大规模生物医学结论生成数据集。每个实例将摘要的非结论部分与作者撰写的原始结论配对，为证据到结论的推理提供自然存在的监督信号。该数据集还包含生物医学类别及SJR等期刊级元数据，支持跨生物医学领域的子群分析。作为初步研究，我们在结论生成和摘要生成两种提示设置下评估了多种LLMs，并通过基于参考指标的评估和LLM-as-a-judge方法对输出结果进行评分。研究发现：结论写作在行为模式上有别于摘要写作；强模型在当前自动指标下仍呈现紧密聚集状态；评判者身份会显著改变绝对评分。MedConclusion为研究科学证据到结论的推理提供了可复用的数据资源。代码与数据已开源：https://github.com/Harvard-AI-and-Robotics-Lab/MedConclusion。

English

Large language models (LLMs) are widely explored for reasoning-intensive research tasks, yet resources for testing whether they can infer scientific conclusions from structured biomedical evidence remain limited. We introduce MedConclusion, a large-scale dataset of 5.7M PubMed structured abstracts for biomedical conclusion generation. Each instance pairs the non-conclusion sections of an abstract with the original author-written conclusion, providing naturally occurring supervision for evidence-to-conclusion reasoning. MedConclusion also includes journal-level metadata such as biomedical category and SJR, enabling subgroup analysis across biomedical domains. As an initial study, we evaluate diverse LLMs under conclusion and summary prompting settings and score outputs with both reference-based metrics and LLM-as-a-judge. We find that conclusion writing is behaviorally distinct from summary writing, strong models remain closely clustered under current automatic metrics, and judge identity can substantially shift absolute scores. MedConclusion provides a reusable data resource for studying scientific evidence-to-conclusion reasoning. Our code and data are available at: https://github.com/Harvard-AI-and-Robotics-Lab/MedConclusion.