HiFi-SR: 高忠実度音声超解像のための統合生成Transformer-畳み込み敵対ネットワーク

要旨

最近、中間表現（例えばメルスペクトログラム）に基づく音声の超解像（SR）を進化させるために、生成的対立ネットワーク（GANs）の応用が進んできました。しかしながら、通常は独立してトレーニングされた結合されたネットワークに依存する既存のSR手法は、一貫性のない表現と特にドメイン外シナリオにおいて音声品質の低下につながる可能性があります。本研究では、高忠実度音声の超解像を達成するためにエンドツーエンドの対立的トレーニングを活用する統合ネットワークであるHiFi-SRを提案します。当モデルは、潜在表現の予測とそれらを時間領域の波形に変換することの両方をシームレスに処理するように設計された統合トランスフォーマー畳み込みジェネレータを特徴としています。トランスフォーマーネットワークは、低解像度のメルスペクトログラムを潜在空間表現に変換する強力なエンコーダとして機能し、畳み込みネットワークはこれらの表現を高解像度の波形にアップスケールします。高周波数の忠実度を向上させるために、多バンド、多スケールの時間周波数判別器と、対立トレーニングプロセスにおける多スケールのメル再構成損失を組み込んでいます。HiFi-SRは汎用性があり、4 kHzから32 kHzの入力音声信号を48 kHzのサンプリングレートにアップスケールすることができます。実験結果は、HiFi-SRが既存の音声SR手法を、客観的な指標とABX選好テストの両方で、インドメインおよびアウトオブドメインのシナリオの両方で大幅に上回ることを示しています（https://github.com/modelscope/ClearerVoice-Studio）。

English

The application of generative adversarial networks (GANs) has recently advanced speech super-resolution (SR) based on intermediate representations like mel-spectrograms. However, existing SR methods that typically rely on independently trained and concatenated networks may lead to inconsistent representations and poor speech quality, especially in out-of-domain scenarios. In this work, we propose HiFi-SR, a unified network that leverages end-to-end adversarial training to achieve high-fidelity speech super-resolution. Our model features a unified transformer-convolutional generator designed to seamlessly handle both the prediction of latent representations and their conversion into time-domain waveforms. The transformer network serves as a powerful encoder, converting low-resolution mel-spectrograms into latent space representations, while the convolutional network upscales these representations into high-resolution waveforms. To enhance high-frequency fidelity, we incorporate a multi-band, multi-scale time-frequency discriminator, along with a multi-scale mel-reconstruction loss in the adversarial training process. HiFi-SR is versatile, capable of upscaling any input speech signal between 4 kHz and 32 kHz to a 48 kHz sampling rate. Experimental results demonstrate that HiFi-SR significantly outperforms existing speech SR methods across both objective metrics and ABX preference tests, for both in-domain and out-of-domain scenarios (https://github.com/modelscope/ClearerVoice-Studio).

HiFi-SR: 高忠実度音声超解像のための統合生成Transformer-畳み込み敵対ネットワーク

HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial Network for High-Fidelity Speech Super-Resolution

要旨

Support