实现对成千上万种语言的鲁棒语音表示学习。
Towards Robust Speech Representation Learning for Thousands of Languages
June 30, 2024
作者: William Chen, Wangyou Zhang, Yifan Peng, Xinjian Li, Jinchuan Tian, Jiatong Shi, Xuankai Chang, Soumi Maiti, Karen Livescu, Shinji Watanabe
cs.AI
摘要
自监督学习(SSL)通过减少对标记数据的需求,帮助扩展语音技术应用到更多语言。然而,目前的模型仍远未支持世界上7000多种语言。我们提出了XEUS,一种用于普适语音的跨语言编码器,经过在4057种语言上超过100万小时数据的训练,将SSL模型的语言覆盖范围扩展了4倍。我们将现有公开可访问的语料库中的100万小时语音与新创建的涵盖4057种语言的7400多小时语料库相结合,后者将公开发布。为了处理多语言语音数据的多样化条件,我们将典型的SSL掩码预测方法与一种新颖的去混响目标相结合,以增强鲁棒性。我们在多个基准测试上评估了XEUS,并展示它在各种任务中始终优于或达到与最先进的SSL模型相媲美的结果。XEUS在ML-SUPERB基准测试中创造了新的最先进水平:尽管参数或预训练数据较少,但它分别比MMS 1B和w2v-BERT 2.0 v2高出0.8%和4.4%。检查点、代码和数据可在https://www.wavlab.org/activities/2024/xeus/找到。
English
Self-supervised learning (SSL) has helped extend speech technologies to more
languages by reducing the need for labeled data. However, models are still far
from supporting the world's 7000+ languages. We propose XEUS, a Cross-lingual
Encoder for Universal Speech, trained on over 1 million hours of data across
4057 languages, extending the language coverage of SSL models 4-fold. We
combine 1 million hours of speech from existing publicly accessible corpora
with a newly created corpus of 7400+ hours from 4057 languages, which will be
publicly released. To handle the diverse conditions of multilingual speech
data, we augment the typical SSL masked prediction approach with a novel
dereverberation objective, increasing robustness. We evaluate XEUS on several
benchmarks, and show that it consistently outperforms or achieves comparable
results to state-of-the-art (SOTA) SSL models across a variety of tasks. XEUS
sets a new SOTA on the ML-SUPERB benchmark: it outperforms MMS 1B and w2v-BERT
2.0 v2 by 0.8% and 4.4% respectively, despite having less parameters or
pre-training data. Checkpoints, code, and data are found in
https://www.wavlab.org/activities/2024/xeus/.Summary
AI-Generated Summary