MOSEL:950,000小时语音数据用于开源语音基金会在欧盟语言上的模型训练
MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages
October 1, 2024
作者: Marco Gaido, Sara Papi, Luisa Bentivogli, Alessio Brutti, Mauro Cettolo, Roberto Gretter, Marco Matassoni, Mohamed Nabih, Matteo Negri
cs.AI
摘要
随着基础模型(FMs)的兴起,以及针对其风险和影响的监管努力,开源模型引起了极大的兴趣。然而,现有的语音基础模型(SFMs)尽管声称符合开源原则,却未能完全符合,因为没有任何现有的SFMs在开源条款下公开提供模型权重、代码和训练数据。在这项工作中,我们首次着手填补这一空白,重点关注欧盟(EU)的24种官方语言。我们通过调查自动语音识别数据集和符合开源条款的未标记语音语料库,共计收集了950k小时的合适训练数据。此外,我们以宽松的CC-BY许可证发布了441k小时未标记数据的自动转录,从而促进了为欧盟语言创建开源SFMs的工作。
English
The rise of foundation models (FMs), coupled with regulatory efforts
addressing their risks and impacts, has sparked significant interest in
open-source models. However, existing speech FMs (SFMs) fall short of full
compliance with the open-source principles, even if claimed otherwise, as no
existing SFM has model weights, code, and training data publicly available
under open-source terms. In this work, we take the first step toward filling
this gap by focusing on the 24 official languages of the European Union (EU).
We collect suitable training data by surveying automatic speech recognition
datasets and unlabeled speech corpora under open-source compliant licenses, for
a total of 950k hours. Additionally, we release automatic transcripts for 441k
hours of unlabeled data under the permissive CC-BY license, thereby
facilitating the creation of open-source SFMs for the EU languages.Summary
AI-Generated Summary