快速而非花哨:利用丰富数据与基于规则的模型重新思考G2P
Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models
May 19, 2025
作者: Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee
cs.AI
摘要
同形异义词消歧在字素到音素(G2P)转换中仍是一个重大挑战,尤其对于资源匮乏的语言而言。这一挑战具有双重性:(1)构建平衡且全面的同形异义词数据集既费时又成本高昂;(2)特定的消歧策略会引入额外延迟,使其不适用于屏幕阅读器等实时辅助技术应用。本文针对这两个问题提出了解决方案。首先,我们设计了一个半自动化的流程来构建专注于同形异义词的数据集,并介绍了通过该流程生成的HomoRich数据集,通过将其应用于提升波斯语最先进的深度学习G2P系统,验证了其有效性。其次,我们倡导一种范式转变——利用丰富的离线数据集来指导开发适用于对延迟敏感的辅助应用(如屏幕阅读器)的快速、基于规则的方法。为此,我们将最著名的基于规则的G2P系统eSpeak改进为一个快速识别同形异义词的版本,即HomoFast eSpeak。实验结果显示,无论是深度学习系统还是eSpeak系统,在同形异义词消歧准确率上均实现了约30%的提升。
English
Homograph disambiguation remains a significant challenge in
grapheme-to-phoneme (G2P) conversion, especially for low-resource languages.
This challenge is twofold: (1) creating balanced and comprehensive homograph
datasets is labor-intensive and costly, and (2) specific disambiguation
strategies introduce additional latency, making them unsuitable for real-time
applications such as screen readers and other accessibility tools. In this
paper, we address both issues. First, we propose a semi-automated pipeline for
constructing homograph-focused datasets, introduce the HomoRich dataset
generated through this pipeline, and demonstrate its effectiveness by applying
it to enhance a state-of-the-art deep learning-based G2P system for Persian.
Second, we advocate for a paradigm shift - utilizing rich offline datasets to
inform the development of fast, rule-based methods suitable for
latency-sensitive accessibility applications like screen readers. To this end,
we improve one of the most well-known rule-based G2P systems, eSpeak, into a
fast homograph-aware version, HomoFast eSpeak. Our results show an approximate
30% improvement in homograph disambiguation accuracy for the deep
learning-based and eSpeak systems.Summary
AI-Generated Summary