ChatPaper.aiChatPaper

優化帶有口音與情感的多語言文字轉語音系統

Optimizing Multilingual Text-To-Speech with Accents & Emotions

June 19, 2025
作者: Pranav Pawar, Akshansh Dwivedi, Jenish Boricha, Himanshu Gohil, Aditya Dubey
cs.AI

摘要

尖端文本轉語音(TTS)系統在單語環境中已實現高度自然性,然而,由於當前框架中文化細微差異的存在,合成具有正確多語言口音(尤其是印度語言)及上下文相關情感的語音仍面臨挑戰。本文介紹了一種新的TTS架構,該架構整合了口音並保留音譯,同時採用多尺度情感建模,特別針對印地語和印度英語口音進行了優化。我們的方法通過整合一種語言特定的音素對齊混合編碼器-解碼器架構,以及基於母語者語料庫訓練的文化敏感情感嵌入層,並結合了動態口音代碼轉換與殘差向量量化,對Parler-TTS模型進行了擴展。定量測試顯示,口音準確性提升了23.7%(單詞錯誤率從15.4%降至11.8%),且母語聽眾的情感識別準確率達85.3%,超越了METTS和VECL-TTS基準。該系統的新穎之處在於能夠實時代碼混合——生成如“Namaste,讓我們談談<印地語短語>”這樣的語句,在無縫切換口音的同時保持情感一致性。200名用戶的主觀評價顯示,文化正確性的平均意見得分(MOS)為4.2/5,顯著優於現有多語言系統(p<0.01)。本研究通過展示可擴展的口音-情感解耦,使跨語言合成更加可行,並直接應用於南亞教育科技和無障礙軟件中。
English
State-of-the-art text-to-speech (TTS) systems realize high naturalness in monolingual environments, synthesizing speech with correct multilingual accents (especially for Indic languages) and context-relevant emotions still poses difficulty owing to cultural nuance discrepancies in current frameworks. This paper introduces a new TTS architecture integrating accent along with preserving transliteration with multi-scale emotion modelling, in particularly tuned for Hindi and Indian English accent. Our approach extends the Parler-TTS model by integrating A language-specific phoneme alignment hybrid encoder-decoder architecture, and culture-sensitive emotion embedding layers trained on native speaker corpora, as well as incorporating a dynamic accent code switching with residual vector quantization. Quantitative tests demonstrate 23.7% improvement in accent accuracy (Word Error Rate reduction from 15.4% to 11.8%) and 85.3% emotion recognition accuracy from native listeners, surpassing METTS and VECL-TTS baselines. The novelty of the system is that it can mix code in real time - generating statements such as "Namaste, let's talk about <Hindi phrase>" with uninterrupted accent shifts while preserving emotional consistency. Subjective evaluation with 200 users reported a mean opinion score (MOS) of 4.2/5 for cultural correctness, much better than existing multilingual systems (p<0.01). This research makes cross-lingual synthesis more feasible by showcasing scalable accent-emotion disentanglement, with direct application in South Asian EdTech and accessibility software.
PDF228June 23, 2025