LASE: Taal-adversariële sprekerencodering voor Indic cross-script identiteitsbehoud

Samenvatting

Een speaker-encoder voor meertalige stemkloning zou dezelfde spreker identiek moeten behandelen, ongeacht in welk script de audio is uitgesproken. Standaard encoders slagen hier niet in, en de fout is afhankelijk van het accent. Op een corpus van 1043 Westers-geaccentueerde stemparen in het Engels, Hindi, Telugu en Tamil verliest WavLM-base-plus-sv 0.082 absolute cosinusgelijkenis wanneer dezelfde stem van script wisselt, en ECAPA-TDNN verliest 0.105. Op een corpus van 1369 Indiaas-geaccentueerde stemparen krimpt het verschil tot 0.006 (WavLM-SV) en 0.044 (ECAPA-TDNN). Het lek is het grootst waar het er het meest toe doet voor TTS over scripts heen: wanneer een systeem een stem die niet op Indic-talen is getraind, projecteert in Indic-scripts. Wij presenteren LASE (Language-Adversarial Speaker Encoder), een kleine projectiekop op een bevroren WavLM-base-plus, getraind met twee verliesfuncties: een supervised contrastief verlies op stemidentiteit, en een gradient-reversal cross-entropy tegen een 4-talen-classifier die de embedding dwingt taal-oninformatief te zijn terwijl deze spreker-informatief blijft. Getraind op 1118 kwaliteitsgecontroleerde cross-script paren gesynthetiseerd uit 8 commerciële meertalige stemmen, is de resterende kloof van LASE consistent met nul op beide corpora (Δ = 0.013 Westers, Δ = 0.026 Indiaas; beide bootstrap 95% BI's omvatten nul) en vergroot de marge voor cross-script-versus-floor 2.4-2.7x ten opzichte van beide baselines. Een ECAPA+GRL-ablatie toont aan dat het GRL-doel elke backbone verbetert, maar de WavLM-keuze draagt eveneens bij. In synthetische multi-speaker diarisatie evenaart LASE ECAPA-TDNN qua cross-script speaker recall (0.788 vs. 0.789) met ~100x minder trainingsdata. Wij geven de r1 checkpoint, beide corpora en het bootstrap-recept vrij.

English

A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-the-shelf encoders do not, and the failure is accent-conditional. On a 1043-pair Western-accented voice corpus across English, Hindi, Telugu, and Tamil, WavLM-base-plus-sv loses 0.082 absolute cosine similarity when the same voice changes script and ECAPA-TDNN loses 0.105. On a 1369-pair Indian-accented voice corpus, the gap shrinks to 0.006 (WavLM-SV) and 0.044 (ECAPA-TDNN). The leak is largest where it matters most for cross-script TTS: when a system projects a non-Indic-trained voice into Indic scripts. We present LASE (Language-Adversarial Speaker Encoder), a small projection head over frozen WavLM-base-plus trained with two losses: a supervised contrastive loss over voice identity, and a gradient-reversal cross-entropy against a 4-language classifier that pushes the embedding to be language-uninformative while remaining speaker-informative. Trained on 1118 quality-gated cross-script pairs synthesised from 8 commercial multilingual voices, LASE's residual gap is consistent with zero on both corpora (Delta = 0.013 Western, Delta = 0.026 Indian; both bootstrap 95% CIs include zero) and amplifies the cross-script-vs-floor margin 2.4-2.7x over both baselines. An ECAPA+GRL ablation shows the GRL objective improves either backbone but the WavLM choice contributes too. In synthetic multi-speaker diarisation, LASE matches ECAPA-TDNN on cross-script speaker recall (0.788 vs 0.789) with ~100x less training data. We release the r1 checkpoint, both corpora, and the bootstrap recipe.

LASE: Taal-adversariële sprekerencodering voor Indic cross-script identiteitsbehoud

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

Samenvatting

Support