邁向適用於數千種語言的語音表示學習的穩健方法
Towards Robust Speech Representation Learning for Thousands of Languages
June 30, 2024
作者: William Chen, Wangyou Zhang, Yifan Peng, Xinjian Li, Jinchuan Tian, Jiatong Shi, Xuankai Chang, Soumi Maiti, Karen Livescu, Shinji Watanabe
cs.AI
摘要
自我監督學習(SSL)通過減少對標記數據的需求,幫助擴展語音技術應用於更多語言。然而,模型仍然遠未支持全球7000多種語言。我們提出了 XEUS,一種用於通用語音的跨語言編碼器,通過在4057種語言上訓練超過100萬小時的數據,將 SSL 模型的語言覆蓋範圍擴展了4倍。我們將現有公開可訪問的語料庫中的100萬小時語音與新創建的包含來自4057種語言的7400多小時語音的語料庫相結合,這將被公開發布。為應對多語言語音數據的多樣條件,我們將典型的 SSL 掩碼預測方法與一種新的消除混響目標相結合,以提高韌性。我們在幾個基準測試上評估了 XEUS,並展示了它在各種任務上始終優於或達到與最先進的 SSL 模型相當的結果。XEUS 在 ML-SUPERB 基準測試中創下了新的最先進水平:儘管具有更少的參數或預訓練數據,它分別比 MMS 1B 和 w2v-BERT 2.0 v2 高出0.8% 和 4.4%。檢查點、代碼和數據可在 https://www.wavlab.org/activities/2024/xeus/ 找到。
English
Self-supervised learning (SSL) has helped extend speech technologies to more
languages by reducing the need for labeled data. However, models are still far
from supporting the world's 7000+ languages. We propose XEUS, a Cross-lingual
Encoder for Universal Speech, trained on over 1 million hours of data across
4057 languages, extending the language coverage of SSL models 4-fold. We
combine 1 million hours of speech from existing publicly accessible corpora
with a newly created corpus of 7400+ hours from 4057 languages, which will be
publicly released. To handle the diverse conditions of multilingual speech
data, we augment the typical SSL masked prediction approach with a novel
dereverberation objective, increasing robustness. We evaluate XEUS on several
benchmarks, and show that it consistently outperforms or achieves comparable
results to state-of-the-art (SOTA) SSL models across a variety of tasks. XEUS
sets a new SOTA on the ML-SUPERB benchmark: it outperforms MMS 1B and w2v-BERT
2.0 v2 by 0.8% and 4.4% respectively, despite having less parameters or
pre-training data. Checkpoints, code, and data are found in
https://www.wavlab.org/activities/2024/xeus/.Summary
AI-Generated Summary