MOSEL: オープンソース音声基盤のための950,000時間の音声データ EU言語のモデルトレーニングに

要旨

Foundation Models（FM）の台頭と、それに伴うリスクや影響に対処する規制措置が、オープンソースモデルへの大きな関心を引き起こしています。しかし、既存の音声FM（SFM）は、公言されているとはいえ、既存のSFMがモデルの重み、コード、およびトレーニングデータをオープンソース条件下で公開していないため、オープンソース原則に完全に準拠していないと言えます。本研究では、この課題に取り組む最初の一歩として、欧州連合（EU）の24の公用語に焦点を当てます。我々は、オープンソースに準拠したライセンスの下で、自動音声認識データセットや未ラベルの音声コーパスを調査し、合計950k時間の適切なトレーニングデータを収集しました。さらに、許諾のCC-BYライセンスの下で441k時間の未ラベルデータの自動トランスクリプトを公開することで、EU言語向けのオープンソースSFMの作成を促進しています。

English

The rise of foundation models (FMs), coupled with regulatory efforts addressing their risks and impacts, has sparked significant interest in open-source models. However, existing speech FMs (SFMs) fall short of full compliance with the open-source principles, even if claimed otherwise, as no existing SFM has model weights, code, and training data publicly available under open-source terms. In this work, we take the first step toward filling this gap by focusing on the 24 official languages of the European Union (EU). We collect suitable training data by surveying automatic speech recognition datasets and unlabeled speech corpora under open-source compliant licenses, for a total of 950k hours. Additionally, we release automatic transcripts for 441k hours of unlabeled data under the permissive CC-BY license, thereby facilitating the creation of open-source SFMs for the EU languages.