ChatPaper.aiChatPaper

LASE:面向印度语系跨文字身份保持的语言对抗性说话人编码

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

May 1, 2026
作者: Venkata Pushpak Teja Menta
cs.AI

摘要

在多语言语音克隆任务中,说话人编码器应当对同一说话者产生一致的表征,无论其音频采用何种语言文本录制。然而现成的编码器未能实现这一目标,其失效程度与口音类型相关。在包含1043对跨英语、印地语、泰卢固语和泰米尔语的西方口音语音语料库上,当同一声音切换语言文本时,WavLM-base-plus-sv的余弦相似度绝对值下降0.082,ECAPA-TDNN则下降0.105。而在1369对印度口音语音语料库上,该差距分别缩小至0.006(WavLM-SV)和0.044(ECAPA-TDNN)。这种表征泄漏在跨语种TTS最关键的场景——将非印度语系训练的声音投射至印度语系文本时——表现得最为显著。我们提出LASE(语言对抗性说话人编码器),该模型在冻结的WavLM-base-plus之上添加小型投影头,采用双重损失函数进行训练:基于说话人身份的监督对比损失,以及通过梯度反转对抗4语言分类器的交叉熵损失,使嵌入表征在保留说话人信息的同时消除语言信息。使用8个商业多语言声音合成的1118对经过质量筛选的跨语种配对数据进行训练后,LASE在两个语料库上的残余差距与零值一致(西方口音Δ=0.013,印度口音Δ=0.026;自举95%置信区间均包含零值),并将跨语种与基线差异的边际效应放大2.4-2.7倍。ECAPA+GRL消融实验表明梯度反转目标可提升两种主干网络性能,但WavLM的选择亦具有贡献。在合成多说话人日志任务中,LASE仅用约1/100的训练数据即可实现与ECAPA-TDNN相当的跨语种说话人召回率(0.788 vs 0.789)。我们公开了r1检查点、双语料库及自举训练方案。
English
A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-the-shelf encoders do not, and the failure is accent-conditional. On a 1043-pair Western-accented voice corpus across English, Hindi, Telugu, and Tamil, WavLM-base-plus-sv loses 0.082 absolute cosine similarity when the same voice changes script and ECAPA-TDNN loses 0.105. On a 1369-pair Indian-accented voice corpus, the gap shrinks to 0.006 (WavLM-SV) and 0.044 (ECAPA-TDNN). The leak is largest where it matters most for cross-script TTS: when a system projects a non-Indic-trained voice into Indic scripts. We present LASE (Language-Adversarial Speaker Encoder), a small projection head over frozen WavLM-base-plus trained with two losses: a supervised contrastive loss over voice identity, and a gradient-reversal cross-entropy against a 4-language classifier that pushes the embedding to be language-uninformative while remaining speaker-informative. Trained on 1118 quality-gated cross-script pairs synthesised from 8 commercial multilingual voices, LASE's residual gap is consistent with zero on both corpora (Delta = 0.013 Western, Delta = 0.026 Indian; both bootstrap 95% CIs include zero) and amplifies the cross-script-vs-floor margin 2.4-2.7x over both baselines. An ECAPA+GRL ablation shows the GRL objective improves either backbone but the WavLM choice contributes too. In synthetic multi-speaker diarisation, LASE matches ECAPA-TDNN on cross-script speaker recall (0.788 vs 0.789) with ~100x less training data. We release the r1 checkpoint, both corpora, and the bootstrap recipe.
PDF11May 5, 2026