VALL-E 2：神经编解码语言模型是人类水平的零-shot文本转语音合成器。

摘要

本文介绍了VALL-E 2，这是神经编解码器语言模型的最新进展，标志着零-shot文本转语音合成（TTS）领域的一个里程碑，首次实现了人类水平。基于其前身VALL-E，新版本引入了两项重大改进：重复感知采样通过考虑解码历史中的标记重复来完善原始核采样过程。它不仅稳定了解码过程，还避免了无限循环问题。编码组建模将编解码器代码组织成组，以有效缩短序列长度，不仅提高了推理速度，还解决了长序列建模的挑战。我们在LibriSpeech和VCTK数据集上的实验表明，VALL-E 2在语音稳健性、自然性和说话者相似性方面超越了先前的系统。它是第一个在这些基准上达到人类水平的模型。此外，VALL-E 2始终合成高质量语音，即使是传统上由于复杂性或重复短语而具有挑战性的句子也能如此。这项工作的优势可能有助于有价值的努力，比如为失语症患者或肌萎缩侧索硬化症患者生成语音。VALL-E 2的演示将发布在https://aka.ms/valle2。

English

This paper introduces VALL-E 2, the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time. Based on its predecessor, VALL-E, the new iteration introduces two significant enhancements: Repetition Aware Sampling refines the original nucleus sampling process by accounting for token repetition in the decoding history. It not only stabilizes the decoding but also circumvents the infinite loop issue. Grouped Code Modeling organizes codec codes into groups to effectively shorten the sequence length, which not only boosts inference speed but also addresses the challenges of long sequence modeling. Our experiments on the LibriSpeech and VCTK datasets show that VALL-E 2 surpasses previous systems in speech robustness, naturalness, and speaker similarity. It is the first of its kind to reach human parity on these benchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases. The advantages of this work could contribute to valuable endeavors, such as generating speech for individuals with aphasia or people with amyotrophic lateral sclerosis. Demos of VALL-E 2 will be posted to https://aka.ms/valle2.

VALL-E 2：神经编解码语言模型是人类水平的零-shot文本转语音合成器。

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

摘要

Support