ChatPaper.aiChatPaper

OpenMed NER:面向生物医学命名实体识别的开源领域自适应顶尖Transformer模型,覆盖12个公共数据集

OpenMed NER: Open-Source, Domain-Adapted State-of-the-Art Transformers for Biomedical NER Across 12 Public Datasets

August 3, 2025
作者: Maziyar Panahi
cs.AI

摘要

命名实体识别(NER)是从超过80%的非结构化临床记录和生物医学文献中提取结构化信息的基础。尽管大型语言模型近期取得了进展,但在保持计算效率的同时,跨多种实体类型实现最先进的性能仍是一个重大挑战。我们推出了OpenMed NER,一套开源、领域适应的Transformer模型,结合了轻量级的领域自适应预训练(DAPT)与参数高效的低秩适应(LoRA)。我们的方法在由伦理获取、公开可用的研究资料库和去识别化临床笔记(如PubMed、arXiv和MIMIC-III)汇编的35万段落语料库上,使用DeBERTa-v3、PubMedBERT和BioELECTRA骨干进行成本效益高的DAPT。随后,通过LoRA进行任务特定的微调,仅更新不到1.5%的模型参数。我们在12个已建立的生物医学NER基准测试上评估了我们的模型,涵盖化学品、疾病、基因和物种。OpenMed NER在这12个数据集中的10个上实现了新的最先进micro-F1分数,在多种实体类型上取得了显著提升。我们的模型在基础疾病和化学品基准测试(如BC5CDR-Disease,提升2.70个百分点)上推进了技术前沿,同时在更专业的基因和临床细胞系语料库上实现了超过5.3和9.7个百分点的更大改进。这项工作表明,经过战略调整的开源模型能够超越闭源解决方案。这一性能的达成极为高效:训练在单GPU上不到12小时完成,碳足迹低(<1.2千克二氧化碳当量),产出了许可宽松的开源检查点,旨在帮助从业者促进遵守新兴的数据保护和AI法规,如欧盟AI法案。
English
Named-entity recognition (NER) is fundamental to extracting structured information from the >80% of healthcare data that resides in unstructured clinical notes and biomedical literature. Despite recent advances with large language models, achieving state-of-the-art performance across diverse entity types while maintaining computational efficiency remains a significant challenge. We introduce OpenMed NER, a suite of open-source, domain-adapted transformer models that combine lightweight domain-adaptive pre-training (DAPT) with parameter-efficient Low-Rank Adaptation (LoRA). Our approach performs cost-effective DAPT on a 350k-passage corpus compiled from ethically sourced, publicly available research repositories and de-identified clinical notes (PubMed, arXiv, and MIMIC-III) using DeBERTa-v3, PubMedBERT, and BioELECTRA backbones. This is followed by task-specific fine-tuning with LoRA, which updates less than 1.5% of model parameters. We evaluate our models on 12 established biomedical NER benchmarks spanning chemicals, diseases, genes, and species. OpenMed NER achieves new state-of-the-art micro-F1 scores on 10 of these 12 datasets, with substantial gains across diverse entity types. Our models advance the state-of-the-art on foundational disease and chemical benchmarks (e.g., BC5CDR-Disease, +2.70 pp), while delivering even larger improvements of over 5.3 and 9.7 percentage points on more specialized gene and clinical cell line corpora. This work demonstrates that strategically adapted open-source models can surpass closed-source solutions. This performance is achieved with remarkable efficiency: training completes in under 12 hours on a single GPU with a low carbon footprint (< 1.2 kg CO2e), producing permissively licensed, open-source checkpoints designed to help practitioners facilitate compliance with emerging data protection and AI regulations, such as the EU AI Act.
PDF44August 7, 2025