ChatPaper.aiChatPaper

Bolbosh:面向克什米爾語語音合成的腳本感知流匹配技術

Bolbosh: Script-Aware Flow Matching for Kashmiri Text-to-Speech

March 8, 2026
作者: Tajamul Ashraf, Burhaan Rasheed Zargar, Saeed Abdul Muizz, Ifrah Mushtaq, Nazima Mehdi, Iqra Altaf Gillani, Aadil Amin Kak, Janibul Bashir
cs.AI

摘要

克什米爾語雖擁有約700萬使用者且具官方語言地位及豐富語言遺產,其在語音技術領域仍面臨嚴重資源不足。現有文本轉語音系統的缺失制約了母語者的數字化可及性與包容性人機交互。本研究首創專為克什米爾語設計的開源神經網絡TTS系統,實驗表明針對印度語系訓練的零樣本多語言基線模型因未能有效建模波斯-阿拉伯變音符號及語言特定音系規則,僅獲得1.86平均意見分數,無法生成清晰語音。為此,我們基於Matcha-TTS框架提出Bolbosh——種採用最優傳輸條件流匹配的監督式跨語言適應策略,可在有限配對數據下實現穩定對齊。我們進一步構建包含去混響、靜音修剪和響度歸一化的三階段聲學增強流程,以統一異構語音源並穩定對齊學習。通過擴展模型詞表顯式編碼克什米爾文字素,系統精準保留了細粒度元音區分特徵。最終系統獲得3.63 MOS分數與3.73梅爾倒譜失真度,顯著超越多語言基線,為克什米爾語語音合成設立新標杆。實驗證實,針對變音符號敏感的低資源語言,文字感知與基於流匹配的監督適應是TTS系統成功的關鍵。代碼與數據已開源於:https://github.com/gaash-lab/Bolbosh。
English
Kashmiri is spoken by around 7 million people but remains critically underserved in speech technology, despite its official status and rich linguistic heritage. The lack of robust Text-to-Speech (TTS) systems limits digital accessibility and inclusive human-computer interaction for native speakers. In this work, we present the first dedicated open-source neural TTS system designed for Kashmiri. We show that zero-shot multilingual baselines trained for Indic languages fail to produce intelligible speech, achieving a Mean Opinion Score (MOS) of only 1.86, largely due to inadequate modeling of Perso-Arabic diacritics and language-specific phonotactics. To address these limitations, we propose Bolbosh, a supervised cross-lingual adaptation strategy based on Optimal Transport Conditional Flow Matching (OT-CFM) within the Matcha-TTS framework. This enables stable alignment under limited paired data. We further introduce a three-stage acoustic enhancement pipeline consisting of dereverberation, silence trimming, and loudness normalization to unify heterogeneous speech sources and stabilize alignment learning. The model vocabulary is expanded to explicitly encode Kashmiri graphemes, preserving fine-grained vowel distinctions. Our system achieves a MOS of 3.63 and a Mel-Cepstral Distortion (MCD) of 3.73, substantially outperforming multilingual baselines and establishing a new benchmark for Kashmiri speech synthesis. Our results demonstrate that script-aware and supervised flow-based adaptation are critical for low-resource TTS in diacritic-sensitive languages. Code and data are available at: https://github.com/gaash-lab/Bolbosh.
PDF11March 12, 2026