PromptTTS 2: 使用文本提示描述和生成语音

摘要

语音传达的信息比文本更多，因为同一个词可以用不同的声音发音来传达多样化的信息。与依赖语音提示（参考语音）进行声音变化的传统文本转语音（TTS）方法相比，使用文本提示（描述）更加用户友好，因为语音提示可能很难找到，或者根本不存在。基于文本提示的TTS方法面临两个挑战：1）一对多问题，即文本提示无法描述声音变化的所有细节；2）文本提示数据集的有限可用性，需要供应商和大量数据标记成本来为语音编写文本提示。在这项工作中，我们介绍了PromptTTS 2来解决这些挑战，通过一个变化网络提供文本提示无法捕捉的声音变化信息，以及一个提示生成流程来利用大型语言模型（LLM）来构建高质量的文本提示。具体而言，变化网络根据文本提示的表示预测从参考语音中提取的表示（其中包含有关声音的完整信息）。对于提示生成流程，它使用语音理解模型从语音中识别声音属性（例如性别、速度），并利用大型语言模型根据识别结果生成文本提示。在大规模（44K小时）语音数据集上的实验表明，与先前的工作相比，PromptTTS 2生成的声音与文本提示更一致，并支持多样化声音变化的抽样，从而为用户提供更多的声音生成选择。此外，提示生成流程生成高质量的提示，消除了大量标记成本。PromptTTS 2的演示页面可在线访问https://speechresearch.github.io/prompttts2。

English

Speech conveys more information than just text, as the same word can be uttered in various voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods relying on speech prompts (reference speech) for voice variability, using text prompts (descriptions) is more user-friendly since speech prompts can be hard to find or may not exist at all. TTS approaches based on the text prompt face two challenges: 1) the one-to-many problem, where not all details about voice variability can be described in the text prompt, and 2) the limited availability of text prompt datasets, where vendors and large cost of data labeling are required to write text prompt for speech. In this work, we introduce PromptTTS 2 to address these challenges with a variation network to provide variability information of voice not captured by text prompts, and a prompt generation pipeline to utilize the large language models (LLM) to compose high quality text prompts. Specifically, the variation network predicts the representation extracted from the reference speech (which contains full information about voice) based on the text prompt representation. For the prompt generation pipeline, it generates text prompts for speech with a speech understanding model to recognize voice attributes (e.g., gender, speed) from speech and a large language model to formulate text prompt based on the recognition results. Experiments on a large-scale (44K hours) speech dataset demonstrate that compared to the previous works, PromptTTS 2 generates voices more consistent with text prompts and supports the sampling of diverse voice variability, thereby offering users more choices on voice generation. Additionally, the prompt generation pipeline produces high-quality prompts, eliminating the large labeling cost. The demo page of PromptTTS 2 is available onlinehttps://speechresearch.github.io/prompttts2.

PromptTTS 2: 使用文本提示描述和生成语音

PromptTTS 2: Describing and Generating Voices with Text Prompt

摘要

Support