MusicHiFi: 高速高忠実度ステレオ音声合成

要旨

拡散モデルに基づく音声・音楽生成モデルでは、一般的にオーディオの画像表現（例えばメルスペクトログラム）を構築し、それを位相再構成モデルやボコーダを用いてオーディオに変換することで音楽を生成します。しかし、従来のボコーダは低解像度（例：16-24kHz）のモノラルオーディオしか生成できないため、その効果が制限されていました。本研究では、MusicHiFiという効率的な高忠実度ステレオボコーダを提案します。本手法では、低解像度のメルスペクトログラムをオーディオに変換し、帯域拡張によって高解像度オーディオにアップサンプリングし、ステレオオーディオにアップミックスするための3段階の生成的敵対ネットワーク（GAN）カスケードを採用しています。従来の研究と比較して、1）各段階における統一的なGANベースのジェネレータとディスクリミネータのアーキテクチャおよび学習手順、2）高速でダウンサンプリング互換性に近い新しい帯域拡張モジュール、3）出力においてモノラルコンテンツの保存を保証する高速なダウンミックス互換モノラル・ツー・ステレオアップミキサーを提案しています。本手法を客観的および主観的なリスニングテストで評価した結果、従来の研究と比較して同等または優れた音質、優れた空間化制御、および大幅に高速な推論速度が得られることがわかりました。音声サンプルはhttps://MusicHiFi.github.io/web/で公開しています。

English

Diffusion-based audio and music generation models commonly generate music by constructing an image representation of audio (e.g., a mel-spectrogram) and then converting it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their effectiveness. We propose MusicHiFi -- an efficient high-fidelity stereophonic vocoder. Our method employs a cascade of three generative adversarial networks (GANs) that convert low-resolution mel-spectrograms to audio, upsamples to high-resolution audio via bandwidth expansion, and upmixes to stereophonic audio. Compared to previous work, we propose 1) a unified GAN-based generator and discriminator architecture and training procedure for each stage of our cascade, 2) a new fast, near downsampling-compatible bandwidth extension module, and 3) a new fast downmix-compatible mono-to-stereo upmixer that ensures the preservation of monophonic content in the output. We evaluate our approach using both objective and subjective listening tests and find our approach yields comparable or better audio quality, better spatialization control, and significantly faster inference speed compared to past work. Sound examples are at https://MusicHiFi.github.io/web/.

MusicHiFi: 高速高忠実度ステレオ音声合成

MusicHiFi: Fast High-Fidelity Stereo Vocoding

要旨

Support