アクセント付き音声合成における話者埋め込みと音韻規則の相互作用の定量化

要旨

多くの話し言葉（英語を含む）は、方言やアクセントに大きなばらつきがあり、アクセント制御は柔軟なテキスト音声合成モデルにとって重要な能力である。現在のTTSシステムは、一般的に特定のアクセントに関連付けられた話者埋め込みを条件付けとしてアクセント付き音声を生成する。この手法は有効であるが、埋め込みが音色や感情などの特性も符号化するため、解釈可能性と制御性に限界がある。本研究では、アクセント付き音声合成における話者埋め込みと、言語学に基づく音韻規則との相互作用を分析する。アメリカ英語とイギリス英語を事例として、弾音化、R性、母音対応の規則を実装する。我々は、埋め込みが規則に基づく変換をどの程度強く保持または上書きするかを定量化する新規指標である音素置換率を提案する。実験により、規則と埋め込みを組み合わせることでより真正なアクセントが得られる一方、埋め込みが規則を減衰または上書きし、アクセントと話者同一性の間の絡み合いが明らかになった。我々の知見は、規則がアクセント制御の手段として、また音声生成における分離性評価の枠組みとして有効であることを示唆する。

English

Many spoken languages, including English, exhibit wide variation in dialects and accents, making accent control an important capability for flexible text-to-speech (TTS) models. Current TTS systems typically generate accented speech by conditioning on speaker embeddings associated with specific accents. While effective, this approach offers limited interpretability and controllability, as embeddings also encode traits such as timbre and emotion. In this study, we analyze the interaction between speaker embeddings and linguistically motivated phonological rules in accented speech synthesis. Using American and British English as a case study, we implement rules for flapping, rhoticity, and vowel correspondences. We propose the phoneme shift rate (PSR), a novel metric quantifying how strongly embeddings preserve or override rule-based transformations. Experiments show that combining rules with embeddings yields more authentic accents, while embeddings can attenuate or overwrite rules, revealing entanglement between accent and speaker identity. Our findings highlight rules as a lever for accent control and a framework for evaluating disentanglement in speech generation.

アクセント付き音声合成における話者埋め込みと音韻規則の相互作用の定量化

Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis

要旨

Support