EVA-GAN：通过可扩展生成对抗网络增强各种音频生成

摘要

大型模型的出现标志着机器学习进入了一个新时代，通过利用庞大的数据集捕捉和综合复杂模式，大幅优于较小的模型。尽管取得了这些进展，但在扩展方面的探索，特别是在音频生成领域，仍然受限，之前的努力未延伸到高保真（HiFi）44.1kHz领域，并且在高频领域存在频谱不连续和模糊性问题，同时对域外数据缺乏鲁棒性。这些限制限制了模型在包括音乐和歌声生成在内的多种用例中的适用性。我们的工作引入了通过可扩展生成对抗网络（EVA-GAN）增强各种音频生成，相比先前最先进技术在频谱和高频重建以及域外数据性能方面取得了显著改进，实现了通过利用一个包含36,000小时44.1kHz音频的庞大数据集、一个上下文感知模块、一个人机协同的工件测量工具包，并将模型扩展到约2亿参数的HiFi音频生成。我们的工作演示可在https://double-blind-eva-gan.cc 上找到。

English

The advent of Large Models marks a new era in machine learning, significantly outperforming smaller models by leveraging vast datasets to capture and synthesize complex patterns. Despite these advancements, the exploration into scaling, especially in the audio generation domain, remains limited, with previous efforts didn't extend into the high-fidelity (HiFi) 44.1kHz domain and suffering from both spectral discontinuities and blurriness in the high-frequency domain, alongside a lack of robustness against out-of-domain data. These limitations restrict the applicability of models to diverse use cases, including music and singing generation. Our work introduces Enhanced Various Audio Generation via Scalable Generative Adversarial Networks (EVA-GAN), yields significant improvements over previous state-of-the-art in spectral and high-frequency reconstruction and robustness in out-of-domain data performance, enabling the generation of HiFi audios by employing an extensive dataset of 36,000 hours of 44.1kHz audio, a context-aware module, a Human-In-The-Loop artifact measurement toolkit, and expands the model to approximately 200 million parameters. Demonstrations of our work are available at https://double-blind-eva-gan.cc.