面向低资源语言的隐私保护型临床信息抽取小型语言模型
Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages
February 24, 2026
作者: Mohammadreza Ghaffarzadeh-Esfahani, Nahid Yousefian, Ebrahim Heidari-Farsani, Ali Akbar Omidvarian, Sepehr Ghahraei, Atena Farangi, AmirBahador Boroumand
cs.AI
摘要
从低资源语言的医疗记录中提取临床信息仍是医疗自然语言处理(NLP)领域的重大挑战。本研究评估了一种两阶段流程:首先采用Aya-expanse-8B作为波斯语-英语翻译模型,随后使用五个开源小语言模型(SLMs)——Qwen2.5-7B-Instruct、Llama-3.1-8B-Instruct、Llama-3.2-3B-Instruct、Qwen2.5-1.5B-Instruct和Gemma-3-1B-it,对某癌症姑息治疗呼叫中心收集的1,221份匿名波斯语记录进行13项临床特征的二元提取。在不微调的情况下采用少样本提示策略,通过宏平均F1分数、马修斯相关系数(MCC)、敏感性和特异性指标评估模型表现以应对类别不平衡问题。Qwen2.5-7B-Instruct取得最佳综合性能(中位宏F1值:0.899;MCC:0.797),而Gemma-3-1B-it表现最弱。较大参数量模型(7B-8B)在敏感性和MCC指标上持续优于较小模型。对Aya-expanse-8B的双语分析表明,将波斯语记录译为英语可提升敏感性、减少缺失输出,并增强对类别不平衡具有鲁棒性的指标,但会略微降低特异性和精确度。特征级结果显示大多数模型能可靠提取生理症状,而心理主诉、行政请求和复杂躯体特征的提取仍具挑战性。这些发现为在基础设施和标注资源有限的多语言临床NLP环境中部署开源SLMs提供了实用且保护隐私的解决方案蓝图,同时揭示了在敏感医疗应用中联合优化模型规模与输入语言策略的重要性。
English
Extracting clinical information from medical transcripts in low-resource languages remains a significant challenge in healthcare natural language processing (NLP). This study evaluates a two-step pipeline combining Aya-expanse-8B as a Persian-to-English translation model with five open-source small language models (SLMs) -- Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, Llama-3.2-3B-Instruct, Qwen2.5-1.5B-Instruct, and Gemma-3-1B-it -- for binary extraction of 13 clinical features from 1,221 anonymized Persian transcripts collected at a cancer palliative care call center. Using a few-shot prompting strategy without fine-tuning, models were assessed on macro-averaged F1-score, Matthews Correlation Coefficient (MCC), sensitivity, and specificity to account for class imbalance. Qwen2.5-7B-Instruct achieved the highest overall performance (median macro-F1: 0.899; MCC: 0.797), while Gemma-3-1B-it showed the weakest results. Larger models (7B--8B parameters) consistently outperformed smaller counterparts in sensitivity and MCC. A bilingual analysis of Aya-expanse-8B revealed that translating Persian transcripts to English improved sensitivity, reduced missing outputs, and boosted metrics robust to class imbalance, though at the cost of slightly lower specificity and precision. Feature-level results showed reliable extraction of physiological symptoms across most models, whereas psychological complaints, administrative requests, and complex somatic features remained challenging. These findings establish a practical, privacy-preserving blueprint for deploying open-source SLMs in multilingual clinical NLP settings with limited infrastructure and annotation resources, and highlight the importance of jointly optimizing model scale and input language strategy for sensitive healthcare applications.