OWSM v4:通过数据扩展与清洗提升开放式Whisper风格语音模型
OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning
May 31, 2025
作者: Yifan Peng, Shakeel Muhammad, Yui Sudo, William Chen, Jinchuan Tian, Chyi-Jiunn Lin, Shinji Watanabe
cs.AI
摘要
Open Whisper式語音模型(OWSM)項目利用學術規模的資源開發了一系列完全開源的語音基礎模型,但其訓練數據仍顯不足。本研究通過整合YODAS——一個擁有創意共享許可證的大規模網絡爬取數據集——來增強OWSM。然而,由於YODAS的原始特性,其整合過程並非易事,這帶來了諸如錯誤的語言標籤和音頻文本不對齊等挑戰。為解決這些問題,我們開發了一個基於公共工具包的可擴展數據清洗流程,最終得到了一個包含75種語言、總計166,000小時語音的數據集。我們的新系列OWSM v4模型,在這一精選數據集及現有OWSM數據的基礎上進行訓練,在多語言基準測試中顯著超越了之前的版本。在多種場景下,我們的模型甚至與Whisper和MMS等前沿工業模型持平或超越。我們將通過ESPnet工具包公開發布清洗後的YODAS數據、預訓練模型及所有相關腳本。
English
The Open Whisper-style Speech Models (OWSM) project has developed a series of
fully open speech foundation models using academic-scale resources, but their
training data remains insufficient. This work enhances OWSM by integrating
YODAS, a large-scale web-crawled dataset with a Creative Commons license.
However, incorporating YODAS is nontrivial due to its wild nature, which
introduces challenges such as incorrect language labels and audio-text
misalignments. To address this, we develop a scalable data-cleaning pipeline
using public toolkits, yielding a dataset with 166,000 hours of speech across
75 languages. Our new series of OWSM v4 models, trained on this curated dataset
alongside existing OWSM data, significantly outperform previous versions on
multilingual benchmarks. Our models even match or surpass frontier industrial
models like Whisper and MMS in multiple scenarios. We will publicly release the
cleaned YODAS data, pre-trained models, and all associated scripts via the
ESPnet toolkit.