UniAudio：通往通用音頻生成的音頻基礎模型

摘要

語言模型（LMs）已證明具備處理各種生成任務的能力。本文介紹UniAudio系統，與先前的特定任務方法不同，利用LMs技術生成多種類型的音頻（包括語音、聲音、音樂和歌唱）並給定輸入條件。UniAudio 1）首先對所有類型的目標音頻進行標記，以及其他條件模態，2）將源-目標對連接為單個序列，並3）使用LMs進行下一倗預測。此外，提出了一種多尺度Transformer模型，以處理由於基於殘差向量量化的神經編解碼器在標記化中引起的過長序列。UniAudio的訓練擴展至165K小時的音頻和10億參數，基於所有生成任務，旨在獲得足夠的先前知識，不僅涉及音頻的內在特性，還包括音頻與其他模態之間的相互關係。因此，經過訓練的UniAudio模型有潛力成為通用音頻生成的基礎模型：它在所有訓練任務中展現出強大的能力，並可以在簡單微調後無縫支持新的音頻生成任務。實驗表明，UniAudio在11個任務中實現了最先進或至少具有競爭力的結果。Demo和代碼已在https://github.com/yangdongchao/UniAudio上發布。

English

Language models (LMs) have demonstrated the capability to handle a variety of generative tasks. This paper presents the UniAudio system, which, unlike prior task-specific approaches, leverages LMs techniques to generate multiple types of audio (including speech, sounds, music, and singing) with given input conditions. UniAudio 1) first tokenizes all types of target audio along with other condition modalities, 2) concatenates source-target pair as a single sequence, and 3) performs next-token prediction using LMs. Also, a multi-scale Transformer model is proposed to handle the overly long sequences caused by the residual vector quantization based neural codec in tokenization. Training of UniAudio is scaled up to 165K hours of audio and 1B parameters, based on all generative tasks, aiming to obtain sufficient prior knowledge not only in the intrinsic properties of audio but also the inter-relationship between audio and other modalities. Therefore, the trained UniAudio model has the potential to become a foundation model for universal audio generation: it shows strong capability in all trained tasks and can seamlessly support new audio generation tasks after simple fine-tuning. Experiments demonstrate that UniAudio achieves state-of-the-art or at least competitive results on most of the 11 tasks. Demo and code are released at https://github.com/yangdongchao/UniAudio

UniAudio：通往通用音頻生成的音頻基礎模型

UniAudio: An Audio Foundation Model Toward Universal Audio Generation

摘要

Support