稳定的音频开放
Stable Audio Open
July 19, 2024
作者: Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons
cs.AI
摘要
开放式生成模型对社区至关重要,可以进行微调,并在提出新模型时作为基准。然而,大多数当前的文本转音频模型是私有的,无法供艺术家和研究人员构建。在这里,我们描述了一个新的基于开放权重的文本转音频模型的架构和训练过程,该模型使用知识共享许可数据进行训练。我们的评估显示,该模型在各种指标上的性能与最先进的模型相媲美。值得注意的是,报告的FDopenl3结果(衡量生成物的逼真程度)展示了其在44.1kHz下进行高质量立体声音频合成的潜力。
English
Open generative models are vitally important for the community, allowing for
fine-tunes and serving as baselines when presenting new models. However, most
current text-to-audio models are private and not accessible for artists and
researchers to build upon. Here we describe the architecture and training
process of a new open-weights text-to-audio model trained with Creative Commons
data. Our evaluation shows that the model's performance is competitive with the
state-of-the-art across various metrics. Notably, the reported FDopenl3 results
(measuring the realism of the generations) showcase its potential for
high-quality stereo sound synthesis at 44.1kHz.Summary
AI-Generated Summary