PromptTTS 2：使用文本提示描述和生成語音

摘要

語音傳達的資訊比純文字更豐富，因為同一個詞語可以透過不同的聲音表達多樣的資訊。相較於依賴語音提示（參考語音）以實現聲音變異性的傳統文本轉語音（TTS）方法，使用文本提示（描述）更為用戶友善，因為語音提示可能難以找到，甚至可能根本不存在。基於文本提示的TTS方法面臨兩個挑戰：1）一對多問題，即文本提示無法描述所有有關聲音變異性的細節；2）文本提示數據集的有限可用性，需要供應商和大量的數據標記成本來為語音編寫文本提示。在本研究中，我們介紹了PromptTTS 2來應對這些挑戰，該系統使用變異網絡提供文本提示未捕捉到的聲音變異信息，並使用提示生成管道利用大型語言模型（LLM）來構建高質量的文本提示。具體而言，變異網絡根據文本提示的表示預測從參考語音中提取的表示（包含有關聲音的完整信息）。對於提示生成管道，它使用語音理解模型從語音中識別聲音屬性（例如性別、速度），並利用大型語言模型根據識別結果生成文本提示。在一個大規模（44K小時）的語音數據集上進行的實驗表明，與先前的工作相比，PromptTTS 2生成的聲音與文本提示更一致，支持多樣聲音變異性的抽樣，從而為用戶提供更多語音生成選擇。此外，提示生成管道生成高質量的提示，消除了大量標記成本。PromptTTS 2的演示頁面可在線上找到：https://speechresearch.github.io/prompttts2。

English

Speech conveys more information than just text, as the same word can be uttered in various voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods relying on speech prompts (reference speech) for voice variability, using text prompts (descriptions) is more user-friendly since speech prompts can be hard to find or may not exist at all. TTS approaches based on the text prompt face two challenges: 1) the one-to-many problem, where not all details about voice variability can be described in the text prompt, and 2) the limited availability of text prompt datasets, where vendors and large cost of data labeling are required to write text prompt for speech. In this work, we introduce PromptTTS 2 to address these challenges with a variation network to provide variability information of voice not captured by text prompts, and a prompt generation pipeline to utilize the large language models (LLM) to compose high quality text prompts. Specifically, the variation network predicts the representation extracted from the reference speech (which contains full information about voice) based on the text prompt representation. For the prompt generation pipeline, it generates text prompts for speech with a speech understanding model to recognize voice attributes (e.g., gender, speed) from speech and a large language model to formulate text prompt based on the recognition results. Experiments on a large-scale (44K hours) speech dataset demonstrate that compared to the previous works, PromptTTS 2 generates voices more consistent with text prompts and supports the sampling of diverse voice variability, thereby offering users more choices on voice generation. Additionally, the prompt generation pipeline produces high-quality prompts, eliminating the large labeling cost. The demo page of PromptTTS 2 is available onlinehttps://speechresearch.github.io/prompttts2.

PromptTTS 2：使用文本提示描述和生成語音

PromptTTS 2: Describing and Generating Voices with Text Prompt

摘要

Support