优化多语言文本转语音:口音与情感表达
Optimizing Multilingual Text-To-Speech with Accents & Emotions
June 19, 2025
作者: Pranav Pawar, Akshansh Dwivedi, Jenish Boricha, Himanshu Gohil, Aditya Dubey
cs.AI
摘要
当前最先进的文本转语音(TTS)系统在单语环境中已实现高度自然度,但在合成具有正确多语言口音(尤其是印度语言)及上下文相关情感的语音方面,仍因现有框架中的文化细微差异而面临挑战。本文提出了一种新型TTS架构,该架构集成了口音保留与多尺度情感建模,特别针对印地语和印度英语口音进行了优化。我们的方法扩展了Parler-TTS模型,通过引入一种语言特定的音素对齐混合编码-解码架构,以及基于母语者语料库训练的文化敏感情感嵌入层,并结合动态口音代码切换与残差向量量化技术。定量测试显示,口音准确率提升了23.7%(单词错误率从15.4%降至11.8%),且母语听众的情感识别准确率达到85.3%,超越了METTS和VECL-TTS基线。该系统的创新之处在于能够实时混合代码——生成如“Namaste,让我们谈谈<印地语短语>”这样的语句,在保持情感一致性的同时实现无缝口音转换。200名用户的主观评价显示,文化正确性的平均意见得分(MOS)为4.2/5,显著优于现有多语言系统(p<0.01)。本研究通过展示可扩展的口音-情感解耦,使跨语言合成更为可行,直接应用于南亚教育科技及无障碍软件领域。
English
State-of-the-art text-to-speech (TTS) systems realize high naturalness in
monolingual environments, synthesizing speech with correct multilingual accents
(especially for Indic languages) and context-relevant emotions still poses
difficulty owing to cultural nuance discrepancies in current frameworks. This
paper introduces a new TTS architecture integrating accent along with
preserving transliteration with multi-scale emotion modelling, in particularly
tuned for Hindi and Indian English accent. Our approach extends the Parler-TTS
model by integrating A language-specific phoneme alignment hybrid
encoder-decoder architecture, and culture-sensitive emotion embedding layers
trained on native speaker corpora, as well as incorporating a dynamic accent
code switching with residual vector quantization. Quantitative tests
demonstrate 23.7% improvement in accent accuracy (Word Error Rate reduction
from 15.4% to 11.8%) and 85.3% emotion recognition accuracy from native
listeners, surpassing METTS and VECL-TTS baselines. The novelty of the system
is that it can mix code in real time - generating statements such as "Namaste,
let's talk about <Hindi phrase>" with uninterrupted accent shifts while
preserving emotional consistency. Subjective evaluation with 200 users reported
a mean opinion score (MOS) of 4.2/5 for cultural correctness, much better than
existing multilingual systems (p<0.01). This research makes cross-lingual
synthesis more feasible by showcasing scalable accent-emotion disentanglement,
with direct application in South Asian EdTech and accessibility software.