LASE:面向印度語系跨文字身份保持的語言對抗性說話人編碼
LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation
May 1, 2026
作者: Venkata Pushpak Teja Menta
cs.AI
摘要
在多语言语音克隆任务中,理想的说话人编码器应对同一说话者保持一致性表征,而不受其发音文本所属语言的影响。然而现有通用编码器存在口音条件性缺陷:当同一声音跨语言转换时,在包含1043对西式口声音频的英语-印地语-泰卢固语-泰米尔语语料库中,WavLM-base-plus-sv的余弦相似度下降0.082,ECAPA-TDNN下降0.105;而在1369对印度口音语音对上,差距分别缩小至0.006(WavLM-SV)和0.044(ECAPA-TDNN)。这种表征泄漏在跨语种TTS最关键的场景——将非印度语系训练的声音投射至印度语系文本时——尤为显著。我们提出LASE(语言对抗性说话人编码器),该模型在冻结的WavLM-base-plus之上添加小型投影头,采用双重损失函数:基于说话人身份的监督对比损失,以及通过梯度反转对抗4语言分类器的交叉熵损失(使嵌入表征保留说话人信息的同时消除语言信息)。使用8个商业多语言声音合成的1118对质量筛选跨语言样本进行训练后,LASE在两种语料库上的残余差异与零值一致(西式口音Δ=0.013,印度口音Δ=0.026,自助法95%置信区间均包含零),且将跨语言vs基线差异边界扩大2.4-2.7倍。ECAPA+GRL消融实验表明梯度反转目标可提升任意主干网络性能,但WavLM的选择亦具贡献。在合成多说话人日志任务中,LASE以约百分之一的训练数据实现了与ECAPA-TDNN相当的跨语言说话人召回率(0.788 vs 0.789)。我们公开了r1检查点、双语料库及自助法评估方案。
English
A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-the-shelf encoders do not, and the failure is accent-conditional. On a 1043-pair Western-accented voice corpus across English, Hindi, Telugu, and Tamil, WavLM-base-plus-sv loses 0.082 absolute cosine similarity when the same voice changes script and ECAPA-TDNN loses 0.105. On a 1369-pair Indian-accented voice corpus, the gap shrinks to 0.006 (WavLM-SV) and 0.044 (ECAPA-TDNN). The leak is largest where it matters most for cross-script TTS: when a system projects a non-Indic-trained voice into Indic scripts. We present LASE (Language-Adversarial Speaker Encoder), a small projection head over frozen WavLM-base-plus trained with two losses: a supervised contrastive loss over voice identity, and a gradient-reversal cross-entropy against a 4-language classifier that pushes the embedding to be language-uninformative while remaining speaker-informative. Trained on 1118 quality-gated cross-script pairs synthesised from 8 commercial multilingual voices, LASE's residual gap is consistent with zero on both corpora (Delta = 0.013 Western, Delta = 0.026 Indian; both bootstrap 95% CIs include zero) and amplifies the cross-script-vs-floor margin 2.4-2.7x over both baselines. An ECAPA+GRL ablation shows the GRL objective improves either backbone but the WavLM choice contributes too. In synthetic multi-speaker diarisation, LASE matches ECAPA-TDNN on cross-script speaker recall (0.788 vs 0.789) with ~100x less training data. We release the r1 checkpoint, both corpora, and the bootstrap recipe.