Bolbosh: Script-bewuste Flow Matching voor Kasjmiri Tekst-naar-Spraak

Samenvatting

Het Kasjmiri wordt door ongeveer 7 miljoen mensen gesproken, maar blijft ernstig onderbedeeld op het gebied van spraaktechnologie, ondanks zijn officiële status en rijke taalkundige erfgoed. Het ontbreken van robuuste tekst-naar-spraak (TTS)-systemen beperkt de digitale toegankelijkheid en inclusieve mens-computerinteractie voor moedertaalsprekers. In dit werk presenteren we het eerste toegewijde open-source neurale TTS-systeem dat voor het Kasjmiri is ontworpen. We tonen aan dat zero-shot meertalige basislijnen die voor Indo-Arische talen zijn getraind, geen verstaanbare spraak produceren, met een Mean Opinion Score (MOS) van slechts 1,86, voornamelijk als gevolg van onvoldoende modellering van Perso-Arabische diakritische tekens en taal-specifieke fonotaxis. Om deze beperkingen aan te pakken, stellen we Bolbosh voor, een supervised cross-linguale adaptatiestrategie gebaseerd op Optimal Transport Conditional Flow Matching (OT-CFM) binnen het Matcha-TTS-framework. Dit maakt stabiele alignering mogelijk bij beperkte gepaarde data. We introduceren verder een drietraps pipeline voor akoestische verbetering, bestaande uit dereverberatie, stilte-afkapping en luidheidsnormalisatie, om heterogene spraakbronnen te verenigen en het aligneringsleren te stabiliseren. De modelvocabulaire wordt uitgebreid om Kasjmiri-grafemen expliciet te coderen, waarbij fijnmazige klinkeronderscheiden behouden blijven. Ons systeem behaalt een MOS van 3,63 en een Mel-Cepstral Distortion (MCD) van 3,73, wat de meertalige basislijnen aanzienlijk overtreft en een nieuwe benchmark vestigt voor Kasjmiri-spraaksynthese. Onze resultaten tonen aan dat script-aware en supervised flow-gebaseerde adaptatie cruciaal zijn voor TTS met weinig bronnen in talen die gevoelig zijn voor diakritische tekens. Code en data zijn beschikbaar op: https://github.com/gaash-lab/Bolbosh.

English

Kashmiri is spoken by around 7 million people but remains critically underserved in speech technology, despite its official status and rich linguistic heritage. The lack of robust Text-to-Speech (TTS) systems limits digital accessibility and inclusive human-computer interaction for native speakers. In this work, we present the first dedicated open-source neural TTS system designed for Kashmiri. We show that zero-shot multilingual baselines trained for Indic languages fail to produce intelligible speech, achieving a Mean Opinion Score (MOS) of only 1.86, largely due to inadequate modeling of Perso-Arabic diacritics and language-specific phonotactics. To address these limitations, we propose Bolbosh, a supervised cross-lingual adaptation strategy based on Optimal Transport Conditional Flow Matching (OT-CFM) within the Matcha-TTS framework. This enables stable alignment under limited paired data. We further introduce a three-stage acoustic enhancement pipeline consisting of dereverberation, silence trimming, and loudness normalization to unify heterogeneous speech sources and stabilize alignment learning. The model vocabulary is expanded to explicitly encode Kashmiri graphemes, preserving fine-grained vowel distinctions. Our system achieves a MOS of 3.63 and a Mel-Cepstral Distortion (MCD) of 3.73, substantially outperforming multilingual baselines and establishing a new benchmark for Kashmiri speech synthesis. Our results demonstrate that script-aware and supervised flow-based adaptation are critical for low-resource TTS in diacritic-sensitive languages. Code and data are available at: https://github.com/gaash-lab/Bolbosh.

Bolbosh: Script-bewuste Flow Matching voor Kasjmiri Tekst-naar-Spraak

Bolbosh: Script-Aware Flow Matching for Kashmiri Text-to-Speech

Samenvatting

Support