ChatPaper.aiChatPaper

快速而非花哨:利用丰富数据与基于规则的模型重新思考G2P

Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models

May 19, 2025
作者: Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee
cs.AI

摘要

同形异义词消歧在字素到音素(G2P)转换中仍是一个重大挑战,尤其对于资源匮乏的语言而言。这一挑战具有双重性:(1)构建平衡且全面的同形异义词数据集既费时又成本高昂;(2)特定的消歧策略会引入额外延迟,使其不适用于屏幕阅读器等实时辅助技术应用。本文针对这两个问题提出了解决方案。首先,我们设计了一个半自动化的流程来构建专注于同形异义词的数据集,并介绍了通过该流程生成的HomoRich数据集,通过将其应用于提升波斯语最先进的深度学习G2P系统,验证了其有效性。其次,我们倡导一种范式转变——利用丰富的离线数据集来指导开发适用于对延迟敏感的辅助应用(如屏幕阅读器)的快速、基于规则的方法。为此,我们将最著名的基于规则的G2P系统eSpeak改进为一个快速识别同形异义词的版本,即HomoFast eSpeak。实验结果显示,无论是深度学习系统还是eSpeak系统,在同形异义词消歧准确率上均实现了约30%的提升。
English
Homograph disambiguation remains a significant challenge in grapheme-to-phoneme (G2P) conversion, especially for low-resource languages. This challenge is twofold: (1) creating balanced and comprehensive homograph datasets is labor-intensive and costly, and (2) specific disambiguation strategies introduce additional latency, making them unsuitable for real-time applications such as screen readers and other accessibility tools. In this paper, we address both issues. First, we propose a semi-automated pipeline for constructing homograph-focused datasets, introduce the HomoRich dataset generated through this pipeline, and demonstrate its effectiveness by applying it to enhance a state-of-the-art deep learning-based G2P system for Persian. Second, we advocate for a paradigm shift - utilizing rich offline datasets to inform the development of fast, rule-based methods suitable for latency-sensitive accessibility applications like screen readers. To this end, we improve one of the most well-known rule-based G2P systems, eSpeak, into a fast homograph-aware version, HomoFast eSpeak. Our results show an approximate 30% improvement in homograph disambiguation accuracy for the deep learning-based and eSpeak systems.

Summary

AI-Generated Summary

PDF02May 20, 2025