ChatPaper.aiChatPaper

MusicHiFi:快速高保真立體聲聲碼化

MusicHiFi: Fast High-Fidelity Stereo Vocoding

March 15, 2024
作者: Ge Zhu, Juan-Pablo Caceres, Zhiyao Duan, Nicholas J. Bryan
cs.AI

摘要

基於擴散的音頻與音樂生成模型通常通過構建音頻的圖像表示(例如mel-頻譜圖)來生成音樂,然後使用相位重建模型或聲碼器將其轉換為音頻。然而,典型的聲碼器產生的是低分辨率的單聲道音頻(例如16-24 kHz),這限制了它們的效果。我們提出了MusicHiFi - 一種高效的高保真立體聲聲碼器。我們的方法採用了三個生成對抗網絡(GANs)的級聯,將低分辨率的mel-頻譜圖轉換為音頻,通過帶寬擴展對音頻進行上採樣以獲得高分辨率音頻,並將其升級為立體聲音頻。與以往的工作相比,我們提出了:1)統一的基於GAN的生成器和鑑別器架構以及培訓程序,適用於我們級聯的每個階段;2)一個新的快速、接近下採樣兼容的帶寬擴展模塊;3)一個新的快速下混兼容的單聲道到立體聲上混器,確保輸出中單聲道內容的保留。我們使用客觀和主觀聆聽測試來評估我們的方法,發現我們的方法在音頻質量、空間定位控制以及推理速度方面均優於過去的工作。聲音示例可在https://MusicHiFi.github.io/web/找到。
English
Diffusion-based audio and music generation models commonly generate music by constructing an image representation of audio (e.g., a mel-spectrogram) and then converting it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their effectiveness. We propose MusicHiFi -- an efficient high-fidelity stereophonic vocoder. Our method employs a cascade of three generative adversarial networks (GANs) that convert low-resolution mel-spectrograms to audio, upsamples to high-resolution audio via bandwidth expansion, and upmixes to stereophonic audio. Compared to previous work, we propose 1) a unified GAN-based generator and discriminator architecture and training procedure for each stage of our cascade, 2) a new fast, near downsampling-compatible bandwidth extension module, and 3) a new fast downmix-compatible mono-to-stereo upmixer that ensures the preservation of monophonic content in the output. We evaluate our approach using both objective and subjective listening tests and find our approach yields comparable or better audio quality, better spatialization control, and significantly faster inference speed compared to past work. Sound examples are at https://MusicHiFi.github.io/web/.

Summary

AI-Generated Summary

PDF191December 15, 2024