優化帶有口音與情感的多語言文字轉語音系統

摘要

尖端文本轉語音（TTS）系統在單語環境中已實現高度自然性，然而，由於當前框架中文化細微差異的存在，合成具有正確多語言口音（尤其是印度語言）及上下文相關情感的語音仍面臨挑戰。本文介紹了一種新的TTS架構，該架構整合了口音並保留音譯，同時採用多尺度情感建模，特別針對印地語和印度英語口音進行了優化。我們的方法通過整合一種語言特定的音素對齊混合編碼器-解碼器架構，以及基於母語者語料庫訓練的文化敏感情感嵌入層，並結合了動態口音代碼轉換與殘差向量量化，對Parler-TTS模型進行了擴展。定量測試顯示，口音準確性提升了23.7%（單詞錯誤率從15.4%降至11.8%），且母語聽眾的情感識別準確率達85.3%，超越了METTS和VECL-TTS基準。該系統的新穎之處在於能夠實時代碼混合——生成如“Namaste，讓我們談談<印地語短語>”這樣的語句，在無縫切換口音的同時保持情感一致性。200名用戶的主觀評價顯示，文化正確性的平均意見得分（MOS）為4.2/5，顯著優於現有多語言系統（p<0.01）。本研究通過展示可擴展的口音-情感解耦，使跨語言合成更加可行，並直接應用於南亞教育科技和無障礙軟件中。

English

State-of-the-art text-to-speech (TTS) systems realize high naturalness in monolingual environments, synthesizing speech with correct multilingual accents (especially for Indic languages) and context-relevant emotions still poses difficulty owing to cultural nuance discrepancies in current frameworks. This paper introduces a new TTS architecture integrating accent along with preserving transliteration with multi-scale emotion modelling, in particularly tuned for Hindi and Indian English accent. Our approach extends the Parler-TTS model by integrating A language-specific phoneme alignment hybrid encoder-decoder architecture, and culture-sensitive emotion embedding layers trained on native speaker corpora, as well as incorporating a dynamic accent code switching with residual vector quantization. Quantitative tests demonstrate 23.7% improvement in accent accuracy (Word Error Rate reduction from 15.4% to 11.8%) and 85.3% emotion recognition accuracy from native listeners, surpassing METTS and VECL-TTS baselines. The novelty of the system is that it can mix code in real time - generating statements such as "Namaste, let's talk about <Hindi phrase>" with uninterrupted accent shifts while preserving emotional consistency. Subjective evaluation with 200 users reported a mean opinion score (MOS) of 4.2/5 for cultural correctness, much better than existing multilingual systems (p<0.01). This research makes cross-lingual synthesis more feasible by showcasing scalable accent-emotion disentanglement, with direct application in South Asian EdTech and accessibility software.

優化帶有口音與情感的多語言文字轉語音系統

Optimizing Multilingual Text-To-Speech with Accents & Emotions

摘要

Support