RuCCoD：邁向俄語ICD編碼的自動化

摘要

本研究探討了在生物醫學資源有限的俄語環境中，自動化臨床編碼的可行性。我們提出了一個新的ICD編碼數據集，該數據集包含來自電子健康記錄（EHRs）的診斷字段，並標註了超過10,000個實體和1,500多個獨特的ICD代碼。此數據集作為多種先進模型的基準，包括BERT、帶有LoRA的LLaMA以及RAG，並進行了額外的實驗，考察跨領域（從PubMed摘要到醫學診斷）和跨術語（從UMLS概念到ICD代碼）的遷移學習。隨後，我們將表現最佳的模型應用於標註一個內部EHR數據集，該數據集包含2017年至2021年的患者病史。我們在精心挑選的測試集上進行的實驗表明，與醫生手動註釋的數據相比，使用自動預測代碼進行訓練能顯著提高準確性。我們相信，這些發現為在資源有限的語言（如俄語）中自動化臨床編碼的潛力提供了寶貴的見解，這可能提升這些情境下的臨床效率和數據準確性。

English

This study investigates the feasibility of automating clinical coding in Russian, a language with limited biomedical resources. We present a new dataset for ICD coding, which includes diagnosis fields from electronic health records (EHRs) annotated with over 10,000 entities and more than 1,500 unique ICD codes. This dataset serves as a benchmark for several state-of-the-art models, including BERT, LLaMA with LoRA, and RAG, with additional experiments examining transfer learning across domains (from PubMed abstracts to medical diagnosis) and terminologies (from UMLS concepts to ICD codes). We then apply the best-performing model to label an in-house EHR dataset containing patient histories from 2017 to 2021. Our experiments, conducted on a carefully curated test set, demonstrate that training with the automated predicted codes leads to a significant improvement in accuracy compared to manually annotated data from physicians. We believe our findings offer valuable insights into the potential for automating clinical coding in resource-limited languages like Russian, which could enhance clinical efficiency and data accuracy in these contexts.

RuCCoD：邁向俄語ICD編碼的自動化

RuCCoD: Towards Automated ICD Coding in Russian

摘要

Support