将基于一般解缠的说话者匿名化技术进行调整,以增强情绪保留。
Adapting General Disentanglement-Based Speaker Anonymization for Enhanced Emotion Preservation
August 12, 2024
作者: Xiaoxiao Miao, Yuxiang Zhang, Xin Wang, Natalia Tomashenko, Donny Cheng Lock Soh, Ian Mcloughlin
cs.AI
摘要
一般基于解缠的说话者匿名化系统通常使用单独的编码器将语音分离为内容、说话者和韵律特征。本文探讨了如何在需要更大程度保留新的语音属性(例如情绪)时,调整这样一个系统。现有系统擅长匿名化说话者嵌入,但并非旨在保留情绪。本文考察了两种策略。首先,我们展示了整合来自预训练情绪编码器的情绪嵌入可以帮助保留情绪线索,尽管这种方法略微牺牲了隐私保护。另一种选择是作为后处理步骤应用于匿名化说话者嵌入的情绪补偿策略。这种策略隐藏了原始说话者的身份,并重新引入在说话者嵌入匿名化过程中丢失的情绪特征。具体而言,我们使用支持向量机模型情绪属性,以学习每种情绪的独立边界。在推断过程中,原始说话者嵌入通过情绪指示器进行处理,以预测情绪并准确选择匹配情绪的支持向量机;同时通过说话者匿名化器来隐藏说话者特征。然后,匿名化的说话者嵌入沿着相应的支持向量机边界朝着增强的情绪方向进行修改,以保存情绪线索。这些提出的策略也有望用于调整一般基于解缠的说话者匿名化系统,以保留其他目标语用属性,对一系列下游任务具有潜在的用途。
English
A general disentanglement-based speaker anonymization system typically
separates speech into content, speaker, and prosody features using individual
encoders. This paper explores how to adapt such a system when a new speech
attribute, for example, emotion, needs to be preserved to a greater extent.
While existing systems are good at anonymizing speaker embeddings, they are not
designed to preserve emotion. Two strategies for this are examined. First, we
show that integrating emotion embeddings from a pre-trained emotion encoder can
help preserve emotional cues, even though this approach slightly compromises
privacy protection. Alternatively, we propose an emotion compensation strategy
as a post-processing step applied to anonymized speaker embeddings. This
conceals the original speaker's identity and reintroduces the emotional traits
lost during speaker embedding anonymization. Specifically, we model the emotion
attribute using support vector machines to learn separate boundaries for each
emotion. During inference, the original speaker embedding is processed in two
ways: one, by an emotion indicator to predict emotion and select the
emotion-matched SVM accurately; and two, by a speaker anonymizer to conceal
speaker characteristics. The anonymized speaker embedding is then modified
along the corresponding SVM boundary towards an enhanced emotional direction to
save the emotional cues. The proposed strategies are also expected to be useful
for adapting a general disentanglement-based speaker anonymization system to
preserve other target paralinguistic attributes, with potential for a range of
downstream tasks.Summary
AI-Generated Summary