快速而非花哨：利用丰富数据与基于规则的模型重新思考G2P

摘要

同形异义词消歧在字素到音素（G2P）转换中仍是一个重大挑战，尤其对于资源匮乏的语言而言。这一挑战具有双重性：（1）构建平衡且全面的同形异义词数据集既费时又成本高昂；（2）特定的消歧策略会引入额外延迟，使其不适用于屏幕阅读器等实时辅助技术应用。本文针对这两个问题提出了解决方案。首先，我们设计了一个半自动化的流程来构建专注于同形异义词的数据集，并介绍了通过该流程生成的HomoRich数据集，通过将其应用于提升波斯语最先进的深度学习G2P系统，验证了其有效性。其次，我们倡导一种范式转变——利用丰富的离线数据集来指导开发适用于对延迟敏感的辅助应用（如屏幕阅读器）的快速、基于规则的方法。为此，我们将最著名的基于规则的G2P系统eSpeak改进为一个快速识别同形异义词的版本，即HomoFast eSpeak。实验结果显示，无论是深度学习系统还是eSpeak系统，在同形异义词消歧准确率上均实现了约30%的提升。

English

Homograph disambiguation remains a significant challenge in grapheme-to-phoneme (G2P) conversion, especially for low-resource languages. This challenge is twofold: (1) creating balanced and comprehensive homograph datasets is labor-intensive and costly, and (2) specific disambiguation strategies introduce additional latency, making them unsuitable for real-time applications such as screen readers and other accessibility tools. In this paper, we address both issues. First, we propose a semi-automated pipeline for constructing homograph-focused datasets, introduce the HomoRich dataset generated through this pipeline, and demonstrate its effectiveness by applying it to enhance a state-of-the-art deep learning-based G2P system for Persian. Second, we advocate for a paradigm shift - utilizing rich offline datasets to inform the development of fast, rule-based methods suitable for latency-sensitive accessibility applications like screen readers. To this end, we improve one of the most well-known rule-based G2P systems, eSpeak, into a fast homograph-aware version, HomoFast eSpeak. Our results show an approximate 30% improvement in homograph disambiguation accuracy for the deep learning-based and eSpeak systems.