LASE: 言語対話型話者エンコーディングによるインド系文字横断的同一性保持

要旨

多言語音声クローニングで使用される話者エンコーダは、音声がどの言語で発話されたかに関わらず、同じ話者を同一に扱うべきである。しかし、既存のオフ・ザ・シェルフのエンコーダはこれを達成できておらず、その失敗はアクセントに条件付けられている。英語、ヒンディー語、テルグ語、タミル語に跨る1043組の西洋アクセント音声コーパスでは、同じ声が言語を変更した際に、WavLM-base-plus-svは0.082、ECAPA-TDNNは0.105の絶対コサイン類似度の低下を示した。一方、1369組のインドアクセント音声コーパスでは、この差はWavLM-SVで0.006、ECAPA-TDNNで0.044に縮小した。この「漏れ」は、非インド系言語で訓練された声をインド系言語に投影するという、クロススクリプトTTSにおいて最も重要な場面で最大となる。本研究では、LASE（Language-Adversarial Speaker Encoder）を提案する。これは、凍結されたWavLM-base-plus上に構築された小さな投影ヘッドであり、2つの損失関数で訓練される：話者IDに関する教師ありコントラスティブ損失と、埋め込み表現が話者情報を保持しつつ言語情報を持たないようにするための、4言語分類器に対する勾配反転を用いたクロスエントロピー損失である。8つの市販多言語音声から合成された、品質管理された1118組のクロススクリプトペアで訓練されたLASEの残差ギャップは、両コーパスでゼロと一致しており（西洋アクセント：Δ=0.013、インドアクセント：Δ=0.026、両方のブートストラップ95％信頼区間はゼロを含む）、またクロススクリプト条件とベースライン条件間のマージンを両ベースラインと比べて2.4～2.7倍に増幅した。ECAPA+GRLによるアブレーション研究は、GRL目的関数がどちらのバックボーンに対しても改善をもたらすが、WavLMの選択も貢献していることを示す。合成マルチスピーカー・ダイアリゼーションでは、LASEはECAPA-TDNNとクロススクリプト話者再現率で同等の性能（0.788対0.789）を、約100倍少ない訓練データで達成した。我々は、r1チェックポイント、両コーパス、およびブートストラップレシピを公開する。

English

A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-the-shelf encoders do not, and the failure is accent-conditional. On a 1043-pair Western-accented voice corpus across English, Hindi, Telugu, and Tamil, WavLM-base-plus-sv loses 0.082 absolute cosine similarity when the same voice changes script and ECAPA-TDNN loses 0.105. On a 1369-pair Indian-accented voice corpus, the gap shrinks to 0.006 (WavLM-SV) and 0.044 (ECAPA-TDNN). The leak is largest where it matters most for cross-script TTS: when a system projects a non-Indic-trained voice into Indic scripts. We present LASE (Language-Adversarial Speaker Encoder), a small projection head over frozen WavLM-base-plus trained with two losses: a supervised contrastive loss over voice identity, and a gradient-reversal cross-entropy against a 4-language classifier that pushes the embedding to be language-uninformative while remaining speaker-informative. Trained on 1118 quality-gated cross-script pairs synthesised from 8 commercial multilingual voices, LASE's residual gap is consistent with zero on both corpora (Delta = 0.013 Western, Delta = 0.026 Indian; both bootstrap 95% CIs include zero) and amplifies the cross-script-vs-floor margin 2.4-2.7x over both baselines. An ECAPA+GRL ablation shows the GRL objective improves either backbone but the WavLM choice contributes too. In synthetic multi-speaker diarisation, LASE matches ECAPA-TDNN on cross-script speaker recall (0.788 vs 0.789) with ~100x less training data. We release the r1 checkpoint, both corpora, and the bootstrap recipe.

LASE: 言語対話型話者エンコーディングによるインド系文字横断的同一性保持

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

要旨

Support