Bolbosh:面向克什米尔语文本转语音的脚本感知流匹配
Bolbosh: Script-Aware Flow Matching for Kashmiri Text-to-Speech
March 8, 2026
作者: Tajamul Ashraf, Burhaan Rasheed Zargar, Saeed Abdul Muizz, Ifrah Mushtaq, Nazima Mehdi, Iqra Altaf Gillani, Aadil Amin Kak, Janibul Bashir
cs.AI
摘要
克什米尔语虽拥有约700万使用者且具备官方语言地位及丰富语言遗产,其在语音技术领域仍处于严重服务不足状态。现有文本转语音(TTS)系统的缺失制约了母语者的数字可及性与包容性人机交互。本研究首次提出专为克什米尔语设计的开源神经TTS系统。实验表明,针对印度语系训练的零样本多语言基线模型因未能有效建模波斯-阿拉伯变音符号及语言特定音系规则,仅获得1.86的平均意见得分(MOS),无法生成清晰语音。为此,我们基于Matcha-TTS框架提出Bolbosh方案——一种采用最优传输条件流匹配(OT-CFM)的监督式跨语言适配策略,该策略能在有限配对数据下实现稳定对齐。我们进一步引入包含去混响、静音修剪和响度归一化的三阶段声学增强流程,以统合异构语音源并稳定对齐学习。通过扩展模型词汇表显式编码克什米尔文字素,系统保留了细粒度元音区别特征。最终系统取得3.63的MOS与3.73的梅尔倒谱失真度(MCD),显著超越多语言基线模型,为克什米尔语音合成树立新标杆。实验结果证实,基于流匹配的脚本感知监督适配对变音符号敏感型低资源语言的TTS至关重要。代码与数据详见:https://github.com/gaash-lab/Bolbosh。
English
Kashmiri is spoken by around 7 million people but remains critically underserved in speech technology, despite its official status and rich linguistic heritage. The lack of robust Text-to-Speech (TTS) systems limits digital accessibility and inclusive human-computer interaction for native speakers. In this work, we present the first dedicated open-source neural TTS system designed for Kashmiri. We show that zero-shot multilingual baselines trained for Indic languages fail to produce intelligible speech, achieving a Mean Opinion Score (MOS) of only 1.86, largely due to inadequate modeling of Perso-Arabic diacritics and language-specific phonotactics. To address these limitations, we propose Bolbosh, a supervised cross-lingual adaptation strategy based on Optimal Transport Conditional Flow Matching (OT-CFM) within the Matcha-TTS framework. This enables stable alignment under limited paired data. We further introduce a three-stage acoustic enhancement pipeline consisting of dereverberation, silence trimming, and loudness normalization to unify heterogeneous speech sources and stabilize alignment learning. The model vocabulary is expanded to explicitly encode Kashmiri graphemes, preserving fine-grained vowel distinctions. Our system achieves a MOS of 3.63 and a Mel-Cepstral Distortion (MCD) of 3.73, substantially outperforming multilingual baselines and establishing a new benchmark for Kashmiri speech synthesis. Our results demonstrate that script-aware and supervised flow-based adaptation are critical for low-resource TTS in diacritic-sensitive languages. Code and data are available at: https://github.com/gaash-lab/Bolbosh.