使用改進的RVQGAN進行高保真音頻壓縮

摘要

語言模型已成功應用於建模自然信號，如圖像、語音和音樂。這些模型的關鍵組件是高質量的神經壓縮模型，能將高維自然信號壓縮為較低維度的離散標記。為此，我們引入了一種高保真度的通用神經音頻壓縮算法，將44.1千赫音頻以約8kbps帶寬壓縮約90倍成標記。我們通過將高保真度音頻生成的進步與來自圖像領域的更好向量量化技術相結合，以及改進的對抗和重建損失來實現這一目標。我們使用單一通用模型對所有領域（語音、環境、音樂等）進行壓縮，使其廣泛適用於所有音頻的生成建模。我們與競爭的音頻壓縮算法進行比較，發現我們的方法在性能上顯著優於它們。我們對每個設計選擇進行了徹底的消融分析，並提供了開源代碼和訓練好的模型權重。我們希望我們的工作能為下一代高保真度音頻建模奠定基礎。

English

Language models have been successfully used to model natural signals, such as images, speech, and music. A key component of these models is a high quality neural compression model that can compress high-dimensional natural signals into lower dimensional discrete tokens. To that end, we introduce a high-fidelity universal neural audio compression algorithm that achieves ~90x compression of 44.1 KHz audio into tokens at just 8kbps bandwidth. We achieve this by combining advances in high-fidelity audio generation with better vector quantization techniques from the image domain, along with improved adversarial and reconstruction losses. We compress all domains (speech, environment, music, etc.) with a single universal model, making it widely applicable to generative modeling of all audio. We compare with competing audio compression algorithms, and find our method outperforms them significantly. We provide thorough ablations for every design choice, as well as open-source code and trained model weights. We hope our work can lay the foundation for the next generation of high-fidelity audio modeling.

使用改進的RVQGAN進行高保真音頻壓縮

High-Fidelity Audio Compression with Improved RVQGAN

摘要

Support