ChatPaper.aiChatPaper

穩定音訊開放

Stable Audio Open

July 19, 2024
作者: Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons
cs.AI

摘要

開放式生成模型對AI社群至關重要,可進行微調並在提出新模型時作為基準。然而,大多數目前的文本轉語音模型是私有的,並不對藝術家和研究人員開放以進行擴展。在這裡,我們描述了一個新的開放權重文本轉語音模型的架構和訓練過程,該模型是使用創用CC授權數據進行訓練的。我們的評估顯示,該模型在各種指標上的表現與最先進的模型競爭力相當。值得注意的是,報告的FDopenl3結果(用於衡量生成物的真實性)展示了其在44.1kHz下進行高質量立體聲音頻合成的潛力。
English
Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.

Summary

AI-Generated Summary

PDF275November 28, 2024