LASE：面向印度語系跨文字身份保持的語言對抗性說話人編碼

摘要

在多语言语音克隆任务中，理想的说话人编码器应对同一说话者保持一致性表征，而不受其发音文本所属语言的影响。然而现有通用编码器存在口音条件性缺陷：当同一声音跨语言转换时，在包含1043对西式口声音频的英语-印地语-泰卢固语-泰米尔语语料库中，WavLM-base-plus-sv的余弦相似度下降0.082，ECAPA-TDNN下降0.105；而在1369对印度口音语音对上，差距分别缩小至0.006（WavLM-SV）和0.044（ECAPA-TDNN）。这种表征泄漏在跨语种TTS最关键的场景——将非印度语系训练的声音投射至印度语系文本时——尤为显著。我们提出LASE（语言对抗性说话人编码器），该模型在冻结的WavLM-base-plus之上添加小型投影头，采用双重损失函数：基于说话人身份的监督对比损失，以及通过梯度反转对抗4语言分类器的交叉熵损失（使嵌入表征保留说话人信息的同时消除语言信息）。使用8个商业多语言声音合成的1118对质量筛选跨语言样本进行训练后，LASE在两种语料库上的残余差异与零值一致（西式口音Δ=0.013，印度口音Δ=0.026，自助法95%置信区间均包含零），且将跨语言vs基线差异边界扩大2.4-2.7倍。ECAPA+GRL消融实验表明梯度反转目标可提升任意主干网络性能，但WavLM的选择亦具贡献。在合成多说话人日志任务中，LASE以约百分之一的训练数据实现了与ECAPA-TDNN相当的跨语言说话人召回率（0.788 vs 0.789）。我们公开了r1检查点、双语料库及自助法评估方案。

English

A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-the-shelf encoders do not, and the failure is accent-conditional. On a 1043-pair Western-accented voice corpus across English, Hindi, Telugu, and Tamil, WavLM-base-plus-sv loses 0.082 absolute cosine similarity when the same voice changes script and ECAPA-TDNN loses 0.105. On a 1369-pair Indian-accented voice corpus, the gap shrinks to 0.006 (WavLM-SV) and 0.044 (ECAPA-TDNN). The leak is largest where it matters most for cross-script TTS: when a system projects a non-Indic-trained voice into Indic scripts. We present LASE (Language-Adversarial Speaker Encoder), a small projection head over frozen WavLM-base-plus trained with two losses: a supervised contrastive loss over voice identity, and a gradient-reversal cross-entropy against a 4-language classifier that pushes the embedding to be language-uninformative while remaining speaker-informative. Trained on 1118 quality-gated cross-script pairs synthesised from 8 commercial multilingual voices, LASE's residual gap is consistent with zero on both corpora (Delta = 0.013 Western, Delta = 0.026 Indian; both bootstrap 95% CIs include zero) and amplifies the cross-script-vs-floor margin 2.4-2.7x over both baselines. An ECAPA+GRL ablation shows the GRL objective improves either backbone but the WavLM choice contributes too. In synthetic multi-speaker diarisation, LASE matches ECAPA-TDNN on cross-script speaker recall (0.788 vs 0.789) with ~100x less training data. We release the r1 checkpoint, both corpora, and the bootstrap recipe.

LASE：面向印度語系跨文字身份保持的語言對抗性說話人編碼

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

摘要

Support