OpenMed NER：12の公開データセットにわたるバイオメディカルNERのためのオープンソース・ドメイン適応型最先端トランスフォーマー

要旨

固有表現抽出（NER）は、非構造化された臨床記録や生物医学文献に存在する医療データの80％以上から構造化された情報を抽出するための基盤技術である。大規模言語モデルの最近の進展にもかかわらず、多様なエンティティタイプにおいて最先端の性能を維持しつつ計算効率を保つことは依然として重要な課題である。本研究では、OpenMed NERを紹介する。これは、軽量なドメイン適応事前学習（DAPT）とパラメータ効率の高いLow-Rank Adaptation（LoRA）を組み合わせたオープンソースのドメイン適応型トランスフォーマーモデルのスイートである。我々のアプローチでは、倫理的に収集された公開研究リポジトリおよび匿名化された臨床記録（PubMed、arXiv、MIMIC-III）から編纂された35万パッセージのコーパスに対して、DeBERTa-v3、PubMedBERT、BioELECTRAをバックボーンとして使用し、コスト効率の良いDAPTを実施する。その後、LoRAを用いたタスク固有のファインチューニングを行い、モデルパラメータの1.5％未満を更新する。我々は、化学物質、疾患、遺伝子、種にわたる12の確立された生物医学NERベンチマークでモデルを評価した。OpenMed NERは、これらの12のデータセットのうち10つで新しい最先端のmicro-F1スコアを達成し、多様なエンティティタイプにわたって大幅な向上を示した。我々のモデルは、基礎的な疾患および化学物質ベンチマーク（例：BC5CDR-Disease、+2.70 pp）において最先端を進める一方、より専門的な遺伝子および臨床細胞株コーパスでは5.3および9.7パーセンテージポイント以上の大幅な改善を提供する。この研究は、戦略的に適応されたオープンソースモデルがクローズドソースソリューションを凌駕できることを示している。この性能は、単一のGPUで12時間未満のトレーニング時間と低いカーボンフットプリント（< 1.2 kg CO2e）で達成され、EU AI法などの新興データ保護およびAI規制への準拠を支援するために設計された許諾ライセンスのオープンソースチェックポイントを生成する。

English

Named-entity recognition (NER) is fundamental to extracting structured information from the >80% of healthcare data that resides in unstructured clinical notes and biomedical literature. Despite recent advances with large language models, achieving state-of-the-art performance across diverse entity types while maintaining computational efficiency remains a significant challenge. We introduce OpenMed NER, a suite of open-source, domain-adapted transformer models that combine lightweight domain-adaptive pre-training (DAPT) with parameter-efficient Low-Rank Adaptation (LoRA). Our approach performs cost-effective DAPT on a 350k-passage corpus compiled from ethically sourced, publicly available research repositories and de-identified clinical notes (PubMed, arXiv, and MIMIC-III) using DeBERTa-v3, PubMedBERT, and BioELECTRA backbones. This is followed by task-specific fine-tuning with LoRA, which updates less than 1.5% of model parameters. We evaluate our models on 12 established biomedical NER benchmarks spanning chemicals, diseases, genes, and species. OpenMed NER achieves new state-of-the-art micro-F1 scores on 10 of these 12 datasets, with substantial gains across diverse entity types. Our models advance the state-of-the-art on foundational disease and chemical benchmarks (e.g., BC5CDR-Disease, +2.70 pp), while delivering even larger improvements of over 5.3 and 9.7 percentage points on more specialized gene and clinical cell line corpora. This work demonstrates that strategically adapted open-source models can surpass closed-source solutions. This performance is achieved with remarkable efficiency: training completes in under 12 hours on a single GPU with a low carbon footprint (< 1.2 kg CO2e), producing permissively licensed, open-source checkpoints designed to help practitioners facilitate compliance with emerging data protection and AI regulations, such as the EU AI Act.

OpenMed NER：12の公開データセットにわたるバイオメディカルNERのためのオープンソース・ドメイン適応型最先端トランスフォーマー

OpenMed NER: Open-Source, Domain-Adapted State-of-the-Art Transformers for Biomedical NER Across 12 Public Datasets

要旨

Support