多言語テキスト読み上げの最適化：アクセントと感情の統合

要旨

最先端のテキスト・トゥ・スピーチ（TTS）システムは、単一言語環境において高い自然性を実現しているが、多言語アクセント（特にインド諸語）や文脈に応じた感情を正確に合成することは、現在のフレームワークにおける文化的ニュアンスの差異により依然として困難を伴う。本論文では、ヒンディー語およびインド英語のアクセントに特に適した、多尺度感情モデリングを統合しつつ、翻字を保持する新しいTTSアーキテクチャを提案する。我々のアプローチは、Parler-TTSモデルを拡張し、言語固有の音素アライメントを組み込んだハイブリッド・エンコーダ・デコーダアーキテクチャ、ネイティブスピーカーコーパスで訓練された文化に敏感な感情埋め込み層、および残差ベクトル量子化を用いた動的アクセントコードスイッチングを統合している。定量的テストでは、アクセントの精度が23.7％向上し（単語誤り率が15.4％から11.8％に減少）、ネイティブリスナーによる感情認識精度が85.3％に達し、METTSおよびVECL-TTSのベースラインを上回った。本システムの新規性は、リアルタイムでコードを混合できる点にあり、「ナマステ、<ヒンディー語のフレーズ>について話しましょう」といった文を、感情の一貫性を保ちつつアクセントのシフトを途切れなく生成することが可能である。200人のユーザーによる主観的評価では、文化的正確性に対する平均意見スコア（MOS）が4.2/5と報告され、既存の多言語システムよりも大幅に優れていた（p<0.01）。本研究は、スケーラブルなアクセントと感情の分離を示すことで、南アジアのEdTechおよびアクセシビリティソフトウェアへの直接的な応用を通じて、クロスリンガル合成をより実現可能なものにしている。

English

State-of-the-art text-to-speech (TTS) systems realize high naturalness in monolingual environments, synthesizing speech with correct multilingual accents (especially for Indic languages) and context-relevant emotions still poses difficulty owing to cultural nuance discrepancies in current frameworks. This paper introduces a new TTS architecture integrating accent along with preserving transliteration with multi-scale emotion modelling, in particularly tuned for Hindi and Indian English accent. Our approach extends the Parler-TTS model by integrating A language-specific phoneme alignment hybrid encoder-decoder architecture, and culture-sensitive emotion embedding layers trained on native speaker corpora, as well as incorporating a dynamic accent code switching with residual vector quantization. Quantitative tests demonstrate 23.7% improvement in accent accuracy (Word Error Rate reduction from 15.4% to 11.8%) and 85.3% emotion recognition accuracy from native listeners, surpassing METTS and VECL-TTS baselines. The novelty of the system is that it can mix code in real time - generating statements such as "Namaste, let's talk about <Hindi phrase>" with uninterrupted accent shifts while preserving emotional consistency. Subjective evaluation with 200 users reported a mean opinion score (MOS) of 4.2/5 for cultural correctness, much better than existing multilingual systems (p<0.01). This research makes cross-lingual synthesis more feasible by showcasing scalable accent-emotion disentanglement, with direct application in South Asian EdTech and accessibility software.

多言語テキスト読み上げの最適化：アクセントと感情の統合

Optimizing Multilingual Text-To-Speech with Accents & Emotions

要旨

Support