利用改进的RVQGAN进行高保真音频压缩。

摘要

语言模型已成功用于对自然信号（如图像、语音和音乐）进行建模。这些模型的关键组成部分是高质量的神经压缩模型，能够将高维自然信号压缩为较低维度的离散标记。为此，我们引入了一种高保真度的通用神经音频压缩算法，将44.1千赫音频以仅8kbps带宽的速率压缩约90倍为标记。我们通过将高保真度音频生成的进展与图像领域更好的向量量化技术相结合，以及改进的对抗性和重建损失来实现这一目标。我们使用单一通用模型压缩所有领域（语音、环境、音乐等）的音频，使其广泛适用于所有音频的生成建模。我们与竞争音频压缩算法进行比较，发现我们的方法明显优于它们。我们为每个设计选择提供了彻底的消融分析，以及开源代码和训练好的模型权重。我们希望我们的工作能为高保真度音频建模的下一代奠定基础。

English

Language models have been successfully used to model natural signals, such as images, speech, and music. A key component of these models is a high quality neural compression model that can compress high-dimensional natural signals into lower dimensional discrete tokens. To that end, we introduce a high-fidelity universal neural audio compression algorithm that achieves ~90x compression of 44.1 KHz audio into tokens at just 8kbps bandwidth. We achieve this by combining advances in high-fidelity audio generation with better vector quantization techniques from the image domain, along with improved adversarial and reconstruction losses. We compress all domains (speech, environment, music, etc.) with a single universal model, making it widely applicable to generative modeling of all audio. We compare with competing audio compression algorithms, and find our method outperforms them significantly. We provide thorough ablations for every design choice, as well as open-source code and trained model weights. We hope our work can lay the foundation for the next generation of high-fidelity audio modeling.

利用改进的RVQGAN进行高保真音频压缩。

High-Fidelity Audio Compression with Improved RVQGAN

摘要

Support