快速而非繁複:以豐富數據與規則模型重新思考G2P
Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models
May 19, 2025
作者: Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee
cs.AI
摘要
同形異義詞消歧在字素到音素(G2P)轉換中仍是一大挑戰,尤其對於資源匱乏的語言而言。此挑戰具有雙重性:(1)構建平衡且全面的同形異義詞數據集既耗時又成本高昂;(2)特定的消歧策略會引入額外的延遲,使其不適用於屏幕閱讀器等實時輔助技術應用。本文針對這兩大問題提出解決方案。首先,我們提出了一種半自動化的流程來構建以同形異義詞為核心的數據集,介紹了通過此流程生成的HomoRich數據集,並通過將其應用於提升波斯語最先進的深度學習G2P系統,展示了其有效性。其次,我們倡導一種範式轉變——利用豐富的離線數據集來指導開發適合延遲敏感輔助應用(如屏幕閱讀器)的快速、基於規則的方法。為此,我們將最著名的基於規則的G2P系統之一eSpeak改進為快速識別同形異義詞的版本,即HomoFast eSpeak。實驗結果表明,深度學習系統與eSpeak系統在同形異義詞消歧準確率上均提升了約30%。
English
Homograph disambiguation remains a significant challenge in
grapheme-to-phoneme (G2P) conversion, especially for low-resource languages.
This challenge is twofold: (1) creating balanced and comprehensive homograph
datasets is labor-intensive and costly, and (2) specific disambiguation
strategies introduce additional latency, making them unsuitable for real-time
applications such as screen readers and other accessibility tools. In this
paper, we address both issues. First, we propose a semi-automated pipeline for
constructing homograph-focused datasets, introduce the HomoRich dataset
generated through this pipeline, and demonstrate its effectiveness by applying
it to enhance a state-of-the-art deep learning-based G2P system for Persian.
Second, we advocate for a paradigm shift - utilizing rich offline datasets to
inform the development of fast, rule-based methods suitable for
latency-sensitive accessibility applications like screen readers. To this end,
we improve one of the most well-known rule-based G2P systems, eSpeak, into a
fast homograph-aware version, HomoFast eSpeak. Our results show an approximate
30% improvement in homograph disambiguation accuracy for the deep
learning-based and eSpeak systems.Summary
AI-Generated Summary