ボルボシュ：カシミール語テキスト読み上げのためのスクリプト対応フローマッチング

要旨

カシミール語は約700万人によって話されているが、公式の地位と豊かな言語的遺産にもかかわらず、音声技術におけるサポートは著しく不十分である。堅牢なテキスト音声合成（TTS）システムの欠如は、ネイティブスピーカーにおけるデジタルアクセシビリティと包括的な人間とコンピュータの相互作用を制限している。本研究では、カシミール語向けに設計された初めての専用オープンソースニューラルTTSシステムを提案する。インド系言語向けに訓練されたゼロショット多言語ベースラインは、ペルソ・アラビア文字のダイアクリティカルマークと言語特有の音韻規則の不適切なモデリングが主な原因で、理解可能な音声を生成できず、平均オピニオンスコア（MOS）はわずか1.86であることを示す。これらの制限に対処するため、Matcha-TTSフレームワーク内で最適輸送条件付きフローマッチング（OT-CFM）に基づく教師付き言語間適応戦略であるBolboshを提案する。これにより、限られたペアデータ下での安定したアライメントが可能となる。さらに、残響除去、無音部分トリミング、ラウドネス正規化から構成される3段階の音響強調パイプラインを導入し、異種音声ソースを統一してアライメント学習を安定化させる。モデルの語彙を拡張し、カシミール語の書記素を明示的に符号化することで、細かい母音の区別を保持する。本システムはMOS 3.63、メルケプストラム歪み（MCD）3.73を達成し、多言語ベースラインを大幅に上回り、カシミール語音声合成の新たなベンチマークを確立した。我々の結果は、ダイアクリティカルマークに敏感な言語における低リソースTTSには、文字体系を考慮した教師付きフローベースの適応が重要であることを示す。コードとデータは以下で公開されている：https://github.com/gaash-lab/Bolbosh。

English

Kashmiri is spoken by around 7 million people but remains critically underserved in speech technology, despite its official status and rich linguistic heritage. The lack of robust Text-to-Speech (TTS) systems limits digital accessibility and inclusive human-computer interaction for native speakers. In this work, we present the first dedicated open-source neural TTS system designed for Kashmiri. We show that zero-shot multilingual baselines trained for Indic languages fail to produce intelligible speech, achieving a Mean Opinion Score (MOS) of only 1.86, largely due to inadequate modeling of Perso-Arabic diacritics and language-specific phonotactics. To address these limitations, we propose Bolbosh, a supervised cross-lingual adaptation strategy based on Optimal Transport Conditional Flow Matching (OT-CFM) within the Matcha-TTS framework. This enables stable alignment under limited paired data. We further introduce a three-stage acoustic enhancement pipeline consisting of dereverberation, silence trimming, and loudness normalization to unify heterogeneous speech sources and stabilize alignment learning. The model vocabulary is expanded to explicitly encode Kashmiri graphemes, preserving fine-grained vowel distinctions. Our system achieves a MOS of 3.63 and a Mel-Cepstral Distortion (MCD) of 3.73, substantially outperforming multilingual baselines and establishing a new benchmark for Kashmiri speech synthesis. Our results demonstrate that script-aware and supervised flow-based adaptation are critical for low-resource TTS in diacritic-sensitive languages. Code and data are available at: https://github.com/gaash-lab/Bolbosh.

ボルボシュ：カシミール語テキスト読み上げのためのスクリプト対応フローマッチング

Bolbosh: Script-Aware Flow Matching for Kashmiri Text-to-Speech

要旨

Support