優化帶有口音與情感的多語言文字轉語音系統
Optimizing Multilingual Text-To-Speech with Accents & Emotions
June 19, 2025
作者: Pranav Pawar, Akshansh Dwivedi, Jenish Boricha, Himanshu Gohil, Aditya Dubey
cs.AI
摘要
尖端文本轉語音(TTS)系統在單語環境中已實現高度自然性,然而,由於當前框架中文化細微差異的存在,合成具有正確多語言口音(尤其是印度語言)及上下文相關情感的語音仍面臨挑戰。本文介紹了一種新的TTS架構,該架構整合了口音並保留音譯,同時採用多尺度情感建模,特別針對印地語和印度英語口音進行了優化。我們的方法通過整合一種語言特定的音素對齊混合編碼器-解碼器架構,以及基於母語者語料庫訓練的文化敏感情感嵌入層,並結合了動態口音代碼轉換與殘差向量量化,對Parler-TTS模型進行了擴展。定量測試顯示,口音準確性提升了23.7%(單詞錯誤率從15.4%降至11.8%),且母語聽眾的情感識別準確率達85.3%,超越了METTS和VECL-TTS基準。該系統的新穎之處在於能夠實時代碼混合——生成如“Namaste,讓我們談談<印地語短語>”這樣的語句,在無縫切換口音的同時保持情感一致性。200名用戶的主觀評價顯示,文化正確性的平均意見得分(MOS)為4.2/5,顯著優於現有多語言系統(p<0.01)。本研究通過展示可擴展的口音-情感解耦,使跨語言合成更加可行,並直接應用於南亞教育科技和無障礙軟件中。
English
State-of-the-art text-to-speech (TTS) systems realize high naturalness in
monolingual environments, synthesizing speech with correct multilingual accents
(especially for Indic languages) and context-relevant emotions still poses
difficulty owing to cultural nuance discrepancies in current frameworks. This
paper introduces a new TTS architecture integrating accent along with
preserving transliteration with multi-scale emotion modelling, in particularly
tuned for Hindi and Indian English accent. Our approach extends the Parler-TTS
model by integrating A language-specific phoneme alignment hybrid
encoder-decoder architecture, and culture-sensitive emotion embedding layers
trained on native speaker corpora, as well as incorporating a dynamic accent
code switching with residual vector quantization. Quantitative tests
demonstrate 23.7% improvement in accent accuracy (Word Error Rate reduction
from 15.4% to 11.8%) and 85.3% emotion recognition accuracy from native
listeners, surpassing METTS and VECL-TTS baselines. The novelty of the system
is that it can mix code in real time - generating statements such as "Namaste,
let's talk about <Hindi phrase>" with uninterrupted accent shifts while
preserving emotional consistency. Subjective evaluation with 200 users reported
a mean opinion score (MOS) of 4.2/5 for cultural correctness, much better than
existing multilingual systems (p<0.01). This research makes cross-lingual
synthesis more feasible by showcasing scalable accent-emotion disentanglement,
with direct application in South Asian EdTech and accessibility software.