LASE：面向印度语系跨文字身份保持的语言对抗性说话人编码

摘要

在多语言语音克隆任务中，说话人编码器应当对同一说话者产生一致的表征，无论其音频采用何种语言文本录制。然而现成的编码器未能实现这一目标，其失效程度与口音类型相关。在包含1043对跨英语、印地语、泰卢固语和泰米尔语的西方口音语音语料库上，当同一声音切换语言文本时，WavLM-base-plus-sv的余弦相似度绝对值下降0.082，ECAPA-TDNN则下降0.105。而在1369对印度口音语音语料库上，该差距分别缩小至0.006（WavLM-SV）和0.044（ECAPA-TDNN）。这种表征泄漏在跨语种TTS最关键的场景——将非印度语系训练的声音投射至印度语系文本时——表现得最为显著。我们提出LASE（语言对抗性说话人编码器），该模型在冻结的WavLM-base-plus之上添加小型投影头，采用双重损失函数进行训练：基于说话人身份的监督对比损失，以及通过梯度反转对抗4语言分类器的交叉熵损失，使嵌入表征在保留说话人信息的同时消除语言信息。使用8个商业多语言声音合成的1118对经过质量筛选的跨语种配对数据进行训练后，LASE在两个语料库上的残余差距与零值一致（西方口音Δ=0.013，印度口音Δ=0.026；自举95%置信区间均包含零值），并将跨语种与基线差异的边际效应放大2.4-2.7倍。ECAPA+GRL消融实验表明梯度反转目标可提升两种主干网络性能，但WavLM的选择亦具有贡献。在合成多说话人日志任务中，LASE仅用约1/100的训练数据即可实现与ECAPA-TDNN相当的跨语种说话人召回率（0.788 vs 0.789）。我们公开了r1检查点、双语料库及自举训练方案。

English

A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-the-shelf encoders do not, and the failure is accent-conditional. On a 1043-pair Western-accented voice corpus across English, Hindi, Telugu, and Tamil, WavLM-base-plus-sv loses 0.082 absolute cosine similarity when the same voice changes script and ECAPA-TDNN loses 0.105. On a 1369-pair Indian-accented voice corpus, the gap shrinks to 0.006 (WavLM-SV) and 0.044 (ECAPA-TDNN). The leak is largest where it matters most for cross-script TTS: when a system projects a non-Indic-trained voice into Indic scripts. We present LASE (Language-Adversarial Speaker Encoder), a small projection head over frozen WavLM-base-plus trained with two losses: a supervised contrastive loss over voice identity, and a gradient-reversal cross-entropy against a 4-language classifier that pushes the embedding to be language-uninformative while remaining speaker-informative. Trained on 1118 quality-gated cross-script pairs synthesised from 8 commercial multilingual voices, LASE's residual gap is consistent with zero on both corpora (Delta = 0.013 Western, Delta = 0.026 Indian; both bootstrap 95% CIs include zero) and amplifies the cross-script-vs-floor margin 2.4-2.7x over both baselines. An ECAPA+GRL ablation shows the GRL objective improves either backbone but the WavLM choice contributes too. In synthetic multi-speaker diarisation, LASE matches ECAPA-TDNN on cross-script speaker recall (0.788 vs 0.789) with ~100x less training data. We release the r1 checkpoint, both corpora, and the bootstrap recipe.

LASE：面向印度语系跨文字身份保持的语言对抗性说话人编码

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

摘要

Support