將通用解纜式講者匿名化技術應用於增強情感保留
Adapting General Disentanglement-Based Speaker Anonymization for Enhanced Emotion Preservation
August 12, 2024
作者: Xiaoxiao Miao, Yuxiang Zhang, Xin Wang, Natalia Tomashenko, Donny Cheng Lock Soh, Ian Mcloughlin
cs.AI
摘要
一般基於解交錯的語者匿名化系統通常使用個別編碼器將語音分為內容、說話者和語調特徵。本文探討當需要更大程度地保留新的語音屬性(例如情感)時,如何調整這樣的系統。現有系統擅長匿名化說話者嵌入,但並非旨在保留情感。本文探討了兩種策略。首先,我們展示了從預先訓練的情感編碼器中整合情感嵌入可以幫助保留情感線索,儘管這種方法略微影響隱私保護。另一方面,我們提出一種情感補償策略作為應用於匿名化說話者嵌入的後處理步驟。這種策略隱藏了原始說話者的身份,並重新引入在說話者嵌入匿名化過程中遺失的情感特徵。具體而言,我們使用支持向量機來建模情感屬性,以學習每種情感的單獨邊界。在推斷過程中,原始說話者嵌入以兩種方式進行處理:一種是通過情感指示器來預測情感並準確選擇與情感匹配的支持向量機;另一種是通過說話者匿名化器來隱藏說話者特徵。然後,匿名化的說話者嵌入將沿著相應的支持向量機邊界修改,朝著增強的情感方向保存情感線索。提出的策略也有望對調整一般基於解交錯的語者匿名化系統以保留其他目標語言屬性(如語音外語言特徵)並應用於各種下游任務具有幫助。
English
A general disentanglement-based speaker anonymization system typically
separates speech into content, speaker, and prosody features using individual
encoders. This paper explores how to adapt such a system when a new speech
attribute, for example, emotion, needs to be preserved to a greater extent.
While existing systems are good at anonymizing speaker embeddings, they are not
designed to preserve emotion. Two strategies for this are examined. First, we
show that integrating emotion embeddings from a pre-trained emotion encoder can
help preserve emotional cues, even though this approach slightly compromises
privacy protection. Alternatively, we propose an emotion compensation strategy
as a post-processing step applied to anonymized speaker embeddings. This
conceals the original speaker's identity and reintroduces the emotional traits
lost during speaker embedding anonymization. Specifically, we model the emotion
attribute using support vector machines to learn separate boundaries for each
emotion. During inference, the original speaker embedding is processed in two
ways: one, by an emotion indicator to predict emotion and select the
emotion-matched SVM accurately; and two, by a speaker anonymizer to conceal
speaker characteristics. The anonymized speaker embedding is then modified
along the corresponding SVM boundary towards an enhanced emotional direction to
save the emotional cues. The proposed strategies are also expected to be useful
for adapting a general disentanglement-based speaker anonymization system to
preserve other target paralinguistic attributes, with potential for a range of
downstream tasks.Summary
AI-Generated Summary