优化多语言文本转语音：口音与情感表达

摘要

当前最先进的文本转语音（TTS）系统在单语环境中已实现高度自然度，但在合成具有正确多语言口音（尤其是印度语言）及上下文相关情感的语音方面，仍因现有框架中的文化细微差异而面临挑战。本文提出了一种新型TTS架构，该架构集成了口音保留与多尺度情感建模，特别针对印地语和印度英语口音进行了优化。我们的方法扩展了Parler-TTS模型，通过引入一种语言特定的音素对齐混合编码-解码架构，以及基于母语者语料库训练的文化敏感情感嵌入层，并结合动态口音代码切换与残差向量量化技术。定量测试显示，口音准确率提升了23.7%（单词错误率从15.4%降至11.8%），且母语听众的情感识别准确率达到85.3%，超越了METTS和VECL-TTS基线。该系统的创新之处在于能够实时混合代码——生成如“Namaste，让我们谈谈<印地语短语>”这样的语句，在保持情感一致性的同时实现无缝口音转换。200名用户的主观评价显示，文化正确性的平均意见得分（MOS）为4.2/5，显著优于现有多语言系统（p<0.01）。本研究通过展示可扩展的口音-情感解耦，使跨语言合成更为可行，直接应用于南亚教育科技及无障碍软件领域。

English

State-of-the-art text-to-speech (TTS) systems realize high naturalness in monolingual environments, synthesizing speech with correct multilingual accents (especially for Indic languages) and context-relevant emotions still poses difficulty owing to cultural nuance discrepancies in current frameworks. This paper introduces a new TTS architecture integrating accent along with preserving transliteration with multi-scale emotion modelling, in particularly tuned for Hindi and Indian English accent. Our approach extends the Parler-TTS model by integrating A language-specific phoneme alignment hybrid encoder-decoder architecture, and culture-sensitive emotion embedding layers trained on native speaker corpora, as well as incorporating a dynamic accent code switching with residual vector quantization. Quantitative tests demonstrate 23.7% improvement in accent accuracy (Word Error Rate reduction from 15.4% to 11.8%) and 85.3% emotion recognition accuracy from native listeners, surpassing METTS and VECL-TTS baselines. The novelty of the system is that it can mix code in real time - generating statements such as "Namaste, let's talk about <Hindi phrase>" with uninterrupted accent shifts while preserving emotional consistency. Subjective evaluation with 200 users reported a mean opinion score (MOS) of 4.2/5 for cultural correctness, much better than existing multilingual systems (p<0.01). This research makes cross-lingual synthesis more feasible by showcasing scalable accent-emotion disentanglement, with direct application in South Asian EdTech and accessibility software.

优化多语言文本转语音：口音与情感表达

Optimizing Multilingual Text-To-Speech with Accents & Emotions

摘要

Support