音乐高保真立体声声码化：MusicHiFi

摘要

基于扩散的音频和音乐生成模型通常通过构建音频的图像表示（例如mel-频谱图）来生成音乐，然后使用相位重建模型或声码器将其转换为音频。然而，典型的声码器会以较低分辨率（例如16-24 kHz）生成单声道音频，从而限制了其效果。我们提出了MusicHiFi——一种高保真立体声声码器。我们的方法采用三个生成对抗网络（GANs）级联，将低分辨率mel-频谱图转换为音频，通过带宽扩展上采样到高分辨率音频，并将其升级为立体声音频。与先前的工作相比，我们提出了以下改进：1）统一的基于GAN的生成器和鉴别器架构以及训练程序，用于我们级联的每个阶段；2）一种新的快速、接近降采样兼容的带宽扩展模块；3）一种新的快速降低混音兼容的单声道到立体声的混音器，确保输出中单声道内容的保留。我们使用客观和主观听测试评估了我们的方法，并发现与过去的工作相比，我们的方法在音频质量、空间定位控制和推理速度方面具有可比或更好的表现。音频示例位于https://MusicHiFi.github.io/web/。

English

Diffusion-based audio and music generation models commonly generate music by constructing an image representation of audio (e.g., a mel-spectrogram) and then converting it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their effectiveness. We propose MusicHiFi -- an efficient high-fidelity stereophonic vocoder. Our method employs a cascade of three generative adversarial networks (GANs) that convert low-resolution mel-spectrograms to audio, upsamples to high-resolution audio via bandwidth expansion, and upmixes to stereophonic audio. Compared to previous work, we propose 1) a unified GAN-based generator and discriminator architecture and training procedure for each stage of our cascade, 2) a new fast, near downsampling-compatible bandwidth extension module, and 3) a new fast downmix-compatible mono-to-stereo upmixer that ensures the preservation of monophonic content in the output. We evaluate our approach using both objective and subjective listening tests and find our approach yields comparable or better audio quality, better spatialization control, and significantly faster inference speed compared to past work. Sound examples are at https://MusicHiFi.github.io/web/.

音乐高保真立体声声码化：MusicHiFi

MusicHiFi: Fast High-Fidelity Stereo Vocoding

摘要

Support