高忠実度音声圧縮における改良版RVQGAN

要旨

言語モデルは、画像、音声、音楽などの自然信号をモデル化するために成功裏に使用されてきました。これらのモデルの重要な要素は、高次元の自然信号を低次元の離散トークンに圧縮できる高品質なニューラル圧縮モデルです。この目的のために、44.1 KHzのオーディオを8kbpsの帯域幅でトークンに圧縮し、約90倍の圧縮を実現する高忠実度の汎用ニューラルオーディオ圧縮アルゴリズムを導入します。これを実現するために、高忠実度オーディオ生成の進歩と、画像ドメインからのより優れたベクトル量子化技術、そして改良された敵対的損失および再構成損失を組み合わせました。私たちは、音声、環境音、音楽などすべてのドメインを単一の汎用モデルで圧縮し、すべてのオーディオの生成モデリングに広く適用できるようにしました。競合するオーディオ圧縮アルゴリズムと比較し、私たちの方法がそれらを大幅に上回ることを確認しました。すべての設計選択について徹底的なアブレーションを提供し、オープンソースのコードと訓練済みモデルの重みも公開しています。私たちの研究が、次世代の高忠実度オーディオモデリングの基盤となることを願っています。

English

Language models have been successfully used to model natural signals, such as images, speech, and music. A key component of these models is a high quality neural compression model that can compress high-dimensional natural signals into lower dimensional discrete tokens. To that end, we introduce a high-fidelity universal neural audio compression algorithm that achieves ~90x compression of 44.1 KHz audio into tokens at just 8kbps bandwidth. We achieve this by combining advances in high-fidelity audio generation with better vector quantization techniques from the image domain, along with improved adversarial and reconstruction losses. We compress all domains (speech, environment, music, etc.) with a single universal model, making it widely applicable to generative modeling of all audio. We compare with competing audio compression algorithms, and find our method outperforms them significantly. We provide thorough ablations for every design choice, as well as open-source code and trained model weights. We hope our work can lay the foundation for the next generation of high-fidelity audio modeling.

高忠実度音声圧縮における改良版RVQGAN

High-Fidelity Audio Compression with Improved RVQGAN

要旨

Support