MOSEL：950,000小时语音数据用于开源语音基金会在欧盟语言上的模型训练

摘要

随着基础模型（FMs）的兴起，以及针对其风险和影响的监管努力，开源模型引起了极大的兴趣。然而，现有的语音基础模型（SFMs）尽管声称符合开源原则，却未能完全符合，因为没有任何现有的SFMs在开源条款下公开提供模型权重、代码和训练数据。在这项工作中，我们首次着手填补这一空白，重点关注欧盟（EU）的24种官方语言。我们通过调查自动语音识别数据集和符合开源条款的未标记语音语料库，共计收集了950k小时的合适训练数据。此外，我们以宽松的CC-BY许可证发布了441k小时未标记数据的自动转录，从而促进了为欧盟语言创建开源SFMs的工作。

English

The rise of foundation models (FMs), coupled with regulatory efforts addressing their risks and impacts, has sparked significant interest in open-source models. However, existing speech FMs (SFMs) fall short of full compliance with the open-source principles, even if claimed otherwise, as no existing SFM has model weights, code, and training data publicly available under open-source terms. In this work, we take the first step toward filling this gap by focusing on the 24 official languages of the European Union (EU). We collect suitable training data by surveying automatic speech recognition datasets and unlabeled speech corpora under open-source compliant licenses, for a total of 950k hours. Additionally, we release automatic transcripts for 441k hours of unlabeled data under the permissive CC-BY license, thereby facilitating the creation of open-source SFMs for the EU languages.