ChatPaper.aiChatPaper

口音语音合成中说话人嵌入与音系规则交互关系的量化研究

Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis

January 20, 2026
作者: Thanathai Lertpetchpun, Yoonjeong Lee, Thanapat Trachu, Jihwan Lee, Tiantian Feng, Dani Byrd, Shrikanth Narayanan
cs.AI

摘要

许多口语(包括英语)在方言和口音上存在显著差异,这使得口音控制成为灵活文本转语音(TTS)模型的重要能力。当前TTS系统通常通过关联特定口音的说话人嵌入向量来生成带口音的语音。虽然有效,但该方法可解释性和可控性有限,因为嵌入向量同时编码了音色、情感等特征。本研究分析了说话人嵌入向量与基于语言学的音系规则在口音语音合成中的交互作用。以美式与英式英语为案例,我们实现了闪音、卷舌音及元音对应关系的规则集,并提出音素替换率(PSR)这一新颖指标,用于量化嵌入向量保留或覆盖规则转换的强度。实验表明:规则与嵌入向量结合可生成更真实的口音,而嵌入向量会削弱或覆盖规则,揭示出口音与说话人身份之间的纠缠现象。我们的研究凸显了音系规则作为口音控制杠杆的作用,并为评估语音生成解纠缠提供了框架。
English
Many spoken languages, including English, exhibit wide variation in dialects and accents, making accent control an important capability for flexible text-to-speech (TTS) models. Current TTS systems typically generate accented speech by conditioning on speaker embeddings associated with specific accents. While effective, this approach offers limited interpretability and controllability, as embeddings also encode traits such as timbre and emotion. In this study, we analyze the interaction between speaker embeddings and linguistically motivated phonological rules in accented speech synthesis. Using American and British English as a case study, we implement rules for flapping, rhoticity, and vowel correspondences. We propose the phoneme shift rate (PSR), a novel metric quantifying how strongly embeddings preserve or override rule-based transformations. Experiments show that combining rules with embeddings yields more authentic accents, while embeddings can attenuate or overwrite rules, revealing entanglement between accent and speaker identity. Our findings highlight rules as a lever for accent control and a framework for evaluating disentanglement in speech generation.
PDF51January 23, 2026