StreamVoice:面向实时零样本语音转换的可流式上下文感知语言建模
StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion
January 19, 2024
作者: Zhichao Wang, Yuanzhe Chen, Xinsheng Wang, Zhuo Chen, Lei Xie, Yuping Wang, Yuxuan Wang
cs.AI
摘要
最近的语言模型(LM)进展展示了令人印象深刻的零样本语音转换(VC)性能。然而,现有基于LM的VC模型通常应用离线转换从源语义到声学特征,需要完整的源语音,并限制了它们在实时应用中的部署。在本文中,我们介绍了StreamVoice,一种新颖的基于流式LM的零样本VC模型,可实现给定任意说话者提示和源语音的实时转换。具体而言,为了实现流式处理能力,StreamVoice采用了一个完全因果关系的上下文感知LM,具有一个与时间无关的声学预测器,同时在自回归的每个时间步骤交替处理语义和声学特征,从而消除对完整源语音的依赖。为了解决流式处理中由于上下文不完整而可能导致的性能下降,我们通过两种策略增强了LM的上下文感知能力:1)教师引导的上下文预见,使用教师模型在训练期间总结当前和未来的语义上下文,引导模型对缺失上下文的预测;2)语义屏蔽策略,促进从先前损坏的语义和声学输入进行声学预测,增强上下文学习能力。值得注意的是,StreamVoice是第一个基于LM的流式零样本VC模型,无需任何未来的前瞻。实验结果表明,StreamVoice具有流式转换能力,同时保持与非流式VC系统可比的零样本性能。
English
Recent language model (LM) advancements have showcased impressive zero-shot
voice conversion (VC) performance. However, existing LM-based VC models usually
apply offline conversion from source semantics to acoustic features, demanding
the complete source speech, and limiting their deployment to real-time
applications. In this paper, we introduce StreamVoice, a novel streaming
LM-based model for zero-shot VC, facilitating real-time conversion given
arbitrary speaker prompts and source speech. Specifically, to enable streaming
capability, StreamVoice employs a fully causal context-aware LM with a
temporal-independent acoustic predictor, while alternately processing semantic
and acoustic features at each time step of autoregression which eliminates the
dependence on complete source speech. To address the potential performance
degradation from the incomplete context in streaming processing, we enhance the
context-awareness of the LM through two strategies: 1) teacher-guided context
foresight, using a teacher model to summarize the present and future semantic
context during training to guide the model's forecasting for missing context;
2) semantic masking strategy, promoting acoustic prediction from preceding
corrupted semantic and acoustic input, enhancing context-learning ability.
Notably, StreamVoice is the first LM-based streaming zero-shot VC model without
any future look-ahead. Experimental results demonstrate StreamVoice's streaming
conversion capability while maintaining zero-shot performance comparable to
non-streaming VC systems.