音乐高保真立体声声码化:MusicHiFi
MusicHiFi: Fast High-Fidelity Stereo Vocoding
March 15, 2024
作者: Ge Zhu, Juan-Pablo Caceres, Zhiyao Duan, Nicholas J. Bryan
cs.AI
摘要
基于扩散的音频和音乐生成模型通常通过构建音频的图像表示(例如mel-频谱图)来生成音乐,然后使用相位重建模型或声码器将其转换为音频。然而,典型的声码器会以较低分辨率(例如16-24 kHz)生成单声道音频,从而限制了其效果。我们提出了MusicHiFi——一种高保真立体声声码器。我们的方法采用三个生成对抗网络(GANs)级联,将低分辨率mel-频谱图转换为音频,通过带宽扩展上采样到高分辨率音频,并将其升级为立体声音频。与先前的工作相比,我们提出了以下改进:1)统一的基于GAN的生成器和鉴别器架构以及训练程序,用于我们级联的每个阶段;2)一种新的快速、接近降采样兼容的带宽扩展模块;3)一种新的快速降低混音兼容的单声道到立体声的混音器,确保输出中单声道内容的保留。我们使用客观和主观听测试评估了我们的方法,并发现与过去的工作相比,我们的方法在音频质量、空间定位控制和推理速度方面具有可比或更好的表现。音频示例位于https://MusicHiFi.github.io/web/。
English
Diffusion-based audio and music generation models commonly generate music by
constructing an image representation of audio (e.g., a mel-spectrogram) and
then converting it to audio using a phase reconstruction model or vocoder.
Typical vocoders, however, produce monophonic audio at lower resolutions (e.g.,
16-24 kHz), which limits their effectiveness. We propose MusicHiFi -- an
efficient high-fidelity stereophonic vocoder. Our method employs a cascade of
three generative adversarial networks (GANs) that convert low-resolution
mel-spectrograms to audio, upsamples to high-resolution audio via bandwidth
expansion, and upmixes to stereophonic audio. Compared to previous work, we
propose 1) a unified GAN-based generator and discriminator architecture and
training procedure for each stage of our cascade, 2) a new fast, near
downsampling-compatible bandwidth extension module, and 3) a new fast
downmix-compatible mono-to-stereo upmixer that ensures the preservation of
monophonic content in the output. We evaluate our approach using both objective
and subjective listening tests and find our approach yields comparable or
better audio quality, better spatialization control, and significantly faster
inference speed compared to past work. Sound examples are at
https://MusicHiFi.github.io/web/.Summary
AI-Generated Summary