HiFi-SR:一個統一的生成式Transformer-卷積對抗網絡,用於高保真語音超分辨率。
HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial Network for High-Fidelity Speech Super-Resolution
January 17, 2025
作者: Shengkui Zhao, Kun Zhou, Zexu Pan, Yukun Ma, Chong Zhang, Bin Ma
cs.AI
摘要
最近,生成對抗網絡(GANs)的應用已推動了基於中間表示(如mel-頻譜圖)的語音超分辨率(SR)技術。然而,現有的SR方法通常依賴獨立訓練和串聯的網絡,可能導致表示不一致和語音質量不佳,尤其在域外情況下。在本研究中,我們提出了HiFi-SR,一個統一的網絡,利用端到端的對抗訓練來實現高保真度的語音超分辨率。我們的模型具有一個統一的變壓器-卷積生成器,旨在無縫處理潛在表示的預測及其轉換為時域波形。變壓器網絡作為一個強大的編碼器,將低分辨率的mel-頻譜圖轉換為潛在空間表示,而卷積網絡則將這些表示升頻為高分辨率波形。為了增強高頻保真度,我們在對抗訓練過程中結合了多頻帶、多尺度時頻鑑別器,以及多尺度mel重構損失。HiFi-SR具有多功能性,能將4 kHz至32 kHz之間的任何輸入語音信號升頻至48 kHz的採樣率。實驗結果表明,HiFi-SR在客觀指標和ABX偏好測試中明顯優於現有的語音SR方法,無論是在域內還是域外情況下(https://github.com/modelscope/ClearerVoice-Studio)。
English
The application of generative adversarial networks (GANs) has recently
advanced speech super-resolution (SR) based on intermediate representations
like mel-spectrograms. However, existing SR methods that typically rely on
independently trained and concatenated networks may lead to inconsistent
representations and poor speech quality, especially in out-of-domain scenarios.
In this work, we propose HiFi-SR, a unified network that leverages end-to-end
adversarial training to achieve high-fidelity speech super-resolution. Our
model features a unified transformer-convolutional generator designed to
seamlessly handle both the prediction of latent representations and their
conversion into time-domain waveforms. The transformer network serves as a
powerful encoder, converting low-resolution mel-spectrograms into latent space
representations, while the convolutional network upscales these representations
into high-resolution waveforms. To enhance high-frequency fidelity, we
incorporate a multi-band, multi-scale time-frequency discriminator, along with
a multi-scale mel-reconstruction loss in the adversarial training process.
HiFi-SR is versatile, capable of upscaling any input speech signal between 4
kHz and 32 kHz to a 48 kHz sampling rate. Experimental results demonstrate that
HiFi-SR significantly outperforms existing speech SR methods across both
objective metrics and ABX preference tests, for both in-domain and
out-of-domain scenarios (https://github.com/modelscope/ClearerVoice-Studio).Summary
AI-Generated Summary