通用音素识别的经验性方法研究

摘要

语音识别是多语言及低资源语音处理任务的关键技术，但鲁棒性表现始终难以实现。高性能的英语导向模型难以跨语言泛化，而多语言模型又未能充分利用预训练表征。目前学界对数据规模、模型架构和训练目标如何影响多语言语音识别仍不明确。我们提出PhoneticXEUS模型——通过大规模多语言数据训练，在多语言识别（17.7%音素错误率）和带口音英语语音识别（10.6%音素错误率）上均达到业界最优水平。通过统一评估框架下对100多种语言进行受控消融实验，我们实证确立了训练方案，并量化了自监督学习表征、数据规模和损失函数的影响。此外，我们还分析了跨语系、带口音语音及发音特征的错误模式。所有数据与代码均已开源。

English

Phone recognition (PR) is a key enabler of multilingual and low-resource speech processing tasks, yet robust performance remains elusive. Highly performant English-focused models do not generalize across languages, while multilingual models underutilize pretrained representations. It also remains unclear how data scale, architecture, and training objective contribute to multilingual PR. We present PhoneticXEUS -- trained on large-scale multilingual data and achieving state-of-the-art performance on both multilingual (17.7% PFER) and accented English speech (10.6% PFER). Through controlled ablations with evaluations across 100+ languages under a unified scheme, we empirically establish our training recipe and quantify the impact of SSL representations, data scale, and loss objectives. In addition, we analyze error patterns across language families, accented speech, and articulatory features. All data and code are released openly.