穩定音訊開放

摘要

開放式生成模型對AI社群至關重要，可進行微調並在提出新模型時作為基準。然而，大多數目前的文本轉語音模型是私有的，並不對藝術家和研究人員開放以進行擴展。在這裡，我們描述了一個新的開放權重文本轉語音模型的架構和訓練過程，該模型是使用創用CC授權數據進行訓練的。我們的評估顯示，該模型在各種指標上的表現與最先進的模型競爭力相當。值得注意的是，報告的FDopenl3結果（用於衡量生成物的真實性）展示了其在44.1kHz下進行高質量立體聲音頻合成的潛力。

English

Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.

穩定音訊開放

Stable Audio Open

摘要

Support