生物醫學增強：一個利用大型語言模型進行預訓練並提取稀有與隱藏內容的生物醫學數據集

摘要

我們介紹了Biomed-Enriched，這是一個通過兩階段註釋過程從PubMed構建的生醫文本數據集。在第一階段，一個大型語言模型對來自PubMed科學文章的40萬個段落進行了註釋，為其類型（綜述、研究、臨床案例、其他）、領域（臨床、生物醫學、其他）和教育質量評分。教育質量評分（1至5分）估計了段落對大學水平學習的有用程度。這些註釋隨後用於微調一個小型語言模型，該模型將標籤傳播到整個PMC-OA語料庫中。生成的元數據使我們能夠提取精煉的子集，包括200萬個臨床案例段落，其中超過45萬個高質量段落來自具有商業使用許可的文章，並通過質量過濾和領域上採樣構建了多個變體。由於隱私限制，臨床文本通常難以獲取，因為醫院記錄無法公開分享。因此，我們的數據集提供了一個替代的大規模、公開可用的PubMed臨床案例集合，使其成為生物醫學和臨床自然語言處理的寶貴資源。初步的OLMo2持續預訓練實驗表明，這些精心挑選的子集能夠實現有針對性的改進，臨床上採樣使MMLU ProfMed的性能提升了約5%，教育質量過濾使MedQA和MedMCQA的性能提升了約1%。這些技術的組合導致了更快的收斂，僅用三分之一的訓練詞元就達到了相同的性能，表明更高效和有效的生物醫學預訓練策略的潛力。

English

We introduce Biomed-Enriched, a biomedical text dataset constructed from PubMed via a two-stage annotation process. In the first stage, a large language model annotates 400K paragraphs from PubMed scientific articles, assigning scores for their type (review, study, clinical case, other), domain (clinical, biomedical, other), and educational quality. The educational quality score (rated 1 to 5) estimates how useful a paragraph is for college-level learning. These annotations are then used to fine-tune a small language model, which propagates the labels across the full PMC-OA corpus. The resulting metadata allows us to extract refined subsets, including 2M clinical case paragraphs with over 450K high-quality ones from articles with commercial-use licenses, and to construct several variants via quality filtering and domain upsampling. Clinical text is typically difficult to access due to privacy constraints, as hospital records cannot be publicly shared. Hence, our dataset provides an alternative large-scale, openly available collection of clinical cases from PubMed, making it a valuable resource for biomedical and clinical NLP. Preliminary continual-pretraining experiments with OLMo2 suggest these curated subsets enable targeted improvements, with clinical upsampling boosting performance by ~5% on MMLU ProfMed and educational quality filtering improving MedQA and MedMCQA by ~1%. Combinations of these techniques led to faster convergence, reaching same performance with a third of training tokens, indicating potential for more efficient and effective biomedical pretraining strategies.

生物醫學增強：一個利用大型語言模型進行預訓練並提取稀有與隱藏內容的生物醫學數據集

Biomed-Enriched: A Biomedical Dataset Enriched with LLMs for Pretraining and Extracting Rare and Hidden Content

摘要

Support