StreamVoice：面向实时零样本语音转换的可流式上下文感知语言建模

摘要

最近的语言模型（LM）进展展示了令人印象深刻的零样本语音转换（VC）性能。然而，现有基于LM的VC模型通常应用离线转换从源语义到声学特征，需要完整的源语音，并限制了它们在实时应用中的部署。在本文中，我们介绍了StreamVoice，一种新颖的基于流式LM的零样本VC模型，可实现给定任意说话者提示和源语音的实时转换。具体而言，为了实现流式处理能力，StreamVoice采用了一个完全因果关系的上下文感知LM，具有一个与时间无关的声学预测器，同时在自回归的每个时间步骤交替处理语义和声学特征，从而消除对完整源语音的依赖。为了解决流式处理中由于上下文不完整而可能导致的性能下降，我们通过两种策略增强了LM的上下文感知能力：1）教师引导的上下文预见，使用教师模型在训练期间总结当前和未来的语义上下文，引导模型对缺失上下文的预测；2）语义屏蔽策略，促进从先前损坏的语义和声学输入进行声学预测，增强上下文学习能力。值得注意的是，StreamVoice是第一个基于LM的流式零样本VC模型，无需任何未来的前瞻。实验结果表明，StreamVoice具有流式转换能力，同时保持与非流式VC系统可比的零样本性能。

English

Recent language model (LM) advancements have showcased impressive zero-shot voice conversion (VC) performance. However, existing LM-based VC models usually apply offline conversion from source semantics to acoustic features, demanding the complete source speech, and limiting their deployment to real-time applications. In this paper, we introduce StreamVoice, a novel streaming LM-based model for zero-shot VC, facilitating real-time conversion given arbitrary speaker prompts and source speech. Specifically, to enable streaming capability, StreamVoice employs a fully causal context-aware LM with a temporal-independent acoustic predictor, while alternately processing semantic and acoustic features at each time step of autoregression which eliminates the dependence on complete source speech. To address the potential performance degradation from the incomplete context in streaming processing, we enhance the context-awareness of the LM through two strategies: 1) teacher-guided context foresight, using a teacher model to summarize the present and future semantic context during training to guide the model's forecasting for missing context; 2) semantic masking strategy, promoting acoustic prediction from preceding corrupted semantic and acoustic input, enhancing context-learning ability. Notably, StreamVoice is the first LM-based streaming zero-shot VC model without any future look-ahead. Experimental results demonstrate StreamVoice's streaming conversion capability while maintaining zero-shot performance comparable to non-streaming VC systems.

StreamVoice：面向实时零样本语音转换的可流式上下文感知语言建模

StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion

摘要

Support