生物医学增强数据集：利用大语言模型进行预训练及稀有与隐含内容提取的生物医学数据集

摘要

我们推出了Biomed-Enriched，这是一个通过两阶段标注流程从PubMed构建的生物医学文本数据集。在第一阶段，一个大型语言模型对来自PubMed科学文章的40万段落进行标注，为其类型（综述、研究、临床案例、其他）、领域（临床、生物医学、其他）及教育质量评分。教育质量评分（1至5分）评估了段落对大学水平学习的实用程度。这些标注随后用于微调一个小型语言模型，该模型将标签传播至整个PMC-OA语料库。由此产生的元数据使我们能够提取精炼的子集，包括200万临床案例段落，其中超过45万高质量段落来自具有商业使用许可的文章，并通过质量过滤和领域上采样构建了多个变体。由于隐私限制，临床文本通常难以获取，医院记录无法公开分享。因此，我们的数据集提供了一个替代性的大规模、公开可用的PubMed临床案例集合，使其成为生物医学和临床自然语言处理（NLP）的宝贵资源。初步的OLMo2持续预训练实验表明，这些精选子集能够实现针对性改进，临床上采样使MMLU ProfMed上的性能提升约5%，教育质量过滤使MedQA和MedMCQA提高约1%。这些技术的组合加快了收敛速度，仅用三分之一的训练标记就达到了相同性能，显示出更高效、更有效的生物医学预训练策略的潜力。

English

We introduce Biomed-Enriched, a biomedical text dataset constructed from PubMed via a two-stage annotation process. In the first stage, a large language model annotates 400K paragraphs from PubMed scientific articles, assigning scores for their type (review, study, clinical case, other), domain (clinical, biomedical, other), and educational quality. The educational quality score (rated 1 to 5) estimates how useful a paragraph is for college-level learning. These annotations are then used to fine-tune a small language model, which propagates the labels across the full PMC-OA corpus. The resulting metadata allows us to extract refined subsets, including 2M clinical case paragraphs with over 450K high-quality ones from articles with commercial-use licenses, and to construct several variants via quality filtering and domain upsampling. Clinical text is typically difficult to access due to privacy constraints, as hospital records cannot be publicly shared. Hence, our dataset provides an alternative large-scale, openly available collection of clinical cases from PubMed, making it a valuable resource for biomedical and clinical NLP. Preliminary continual-pretraining experiments with OLMo2 suggest these curated subsets enable targeted improvements, with clinical upsampling boosting performance by ~5% on MMLU ProfMed and educational quality filtering improving MedQA and MedMCQA by ~1%. Combinations of these techniques led to faster convergence, reaching same performance with a third of training tokens, indicating potential for more efficient and effective biomedical pretraining strategies.

生物医学增强数据集：利用大语言模型进行预训练及稀有与隐含内容提取的生物医学数据集

Biomed-Enriched: A Biomedical Dataset Enriched with LLMs for Pretraining and Extracting Rare and Hidden Content

摘要

Support