OpenMed NER：跨12個公開數據集的開源、領域適應型最先進Transformer模型，專注於生物醫學命名實體識別

摘要

命名實體識別（NER）是從超過80%存儲於非結構化臨床記錄和生物醫學文獻中的醫療數據中提取結構化信息的基礎。儘管大型語言模型近期取得了進展，但在保持計算效率的同時，跨多樣實體類型實現最先進性能仍是一大挑戰。我們推出了OpenMed NER，這是一套開源的、領域適應的變換器模型，結合了輕量級的領域適應預訓練（DAPT）與參數高效的低秩適應（LoRA）。我們的方法在一個由道德來源、公開可用的研究庫和去識別化臨床記錄（如PubMed、arXiv和MIMIC-III）編譯的35萬段落語料庫上，使用DeBERTa-v3、PubMedBERT和BioELECTRA骨幹進行了成本效益高的DAPT。隨後，通過LoRA進行任務特定的微調，更新了不到1.5%的模型參數。我們在12個已建立的生物醫學NER基準上評估了我們的模型，涵蓋化學物質、疾病、基因和物種。OpenMed NER在其中10個數據集上達到了新的微F1分數最高紀錄，在多樣實體類型上取得了顯著提升。我們的模型在基礎疾病和化學基準（例如BC5CDR-Disease，+2.70 pp）上推動了技術前沿，同時在更專業的基因和臨床細胞系語料庫上實現了超過5.3和9.7個百分點的更大改進。這項工作表明，經過戰略性適應的開源模型能夠超越閉源解決方案。這一性能的實現極為高效：訓練在單個GPU上不到12小時完成，碳足跡低（<1.2 kg CO2e），產生了許可寬鬆的開源檢查點，旨在幫助從業者促進遵守新興的數據保護和AI法規，如歐盟AI法案。

English

Named-entity recognition (NER) is fundamental to extracting structured information from the >80% of healthcare data that resides in unstructured clinical notes and biomedical literature. Despite recent advances with large language models, achieving state-of-the-art performance across diverse entity types while maintaining computational efficiency remains a significant challenge. We introduce OpenMed NER, a suite of open-source, domain-adapted transformer models that combine lightweight domain-adaptive pre-training (DAPT) with parameter-efficient Low-Rank Adaptation (LoRA). Our approach performs cost-effective DAPT on a 350k-passage corpus compiled from ethically sourced, publicly available research repositories and de-identified clinical notes (PubMed, arXiv, and MIMIC-III) using DeBERTa-v3, PubMedBERT, and BioELECTRA backbones. This is followed by task-specific fine-tuning with LoRA, which updates less than 1.5% of model parameters. We evaluate our models on 12 established biomedical NER benchmarks spanning chemicals, diseases, genes, and species. OpenMed NER achieves new state-of-the-art micro-F1 scores on 10 of these 12 datasets, with substantial gains across diverse entity types. Our models advance the state-of-the-art on foundational disease and chemical benchmarks (e.g., BC5CDR-Disease, +2.70 pp), while delivering even larger improvements of over 5.3 and 9.7 percentage points on more specialized gene and clinical cell line corpora. This work demonstrates that strategically adapted open-source models can surpass closed-source solutions. This performance is achieved with remarkable efficiency: training completes in under 12 hours on a single GPU with a low carbon footprint (< 1.2 kg CO2e), producing permissively licensed, open-source checkpoints designed to help practitioners facilitate compliance with emerging data protection and AI regulations, such as the EU AI Act.

OpenMed NER：跨12個公開數據集的開源、領域適應型最先進Transformer模型，專注於生物醫學命名實體識別

OpenMed NER: Open-Source, Domain-Adapted State-of-the-Art Transformers for Biomedical NER Across 12 Public Datasets

摘要

Support