EVA-GAN:通過可擴展生成對抗網絡增強各種音頻生成
EVA-GAN: Enhanced Various Audio Generation via Scalable Generative Adversarial Networks
January 31, 2024
作者: Shijia Liao, Shiyi Lan, Arun George Zachariah
cs.AI
摘要
大型模型的出現標誌著機器學習的新時代,通過利用龐大數據集來捕捉和綜合複雜模式,顯著優於較小模型。儘管取得了這些進展,但對於尺度化的探索,特別是在音頻生成領域,仍然受限,先前的努力並未擴展到高保真(HiFi)44.1kHz領域,並且在高頻領域存在光譜不連續性和模糊性問題,同時對域外數據缺乏魯棒性。這些限制限制了模型在包括音樂和歌聲生成在內的各種用例中的應用。我們的工作引入了通過可擴展生成對抗網絡(EVA-GAN)增強各種音頻生成,相對於先前最先進技術在光譜和高頻重建以及域外數據性能方面取得了顯著改進,實現了通過使用36,000小時的44.1kHz音頻、上下文感知模塊、人在迴路中的工件測量工具包和將模型擴展到約2億參數的HiFi音頻生成。我們的工作演示可在https://double-blind-eva-gan.cc 上查看。
English
The advent of Large Models marks a new era in machine learning, significantly
outperforming smaller models by leveraging vast datasets to capture and
synthesize complex patterns. Despite these advancements, the exploration into
scaling, especially in the audio generation domain, remains limited, with
previous efforts didn't extend into the high-fidelity (HiFi) 44.1kHz domain and
suffering from both spectral discontinuities and blurriness in the
high-frequency domain, alongside a lack of robustness against out-of-domain
data. These limitations restrict the applicability of models to diverse use
cases, including music and singing generation. Our work introduces Enhanced
Various Audio Generation via Scalable Generative Adversarial Networks
(EVA-GAN), yields significant improvements over previous state-of-the-art in
spectral and high-frequency reconstruction and robustness in out-of-domain data
performance, enabling the generation of HiFi audios by employing an extensive
dataset of 36,000 hours of 44.1kHz audio, a context-aware module, a
Human-In-The-Loop artifact measurement toolkit, and expands the model to
approximately 200 million parameters. Demonstrations of our work are available
at https://double-blind-eva-gan.cc.