OWSM v4：通过数据扩展与清洗提升开放式Whisper风格语音模型

摘要

Open Whisper式語音模型（OWSM）項目利用學術規模的資源開發了一系列完全開源的語音基礎模型，但其訓練數據仍顯不足。本研究通過整合YODAS——一個擁有創意共享許可證的大規模網絡爬取數據集——來增強OWSM。然而，由於YODAS的原始特性，其整合過程並非易事，這帶來了諸如錯誤的語言標籤和音頻文本不對齊等挑戰。為解決這些問題，我們開發了一個基於公共工具包的可擴展數據清洗流程，最終得到了一個包含75種語言、總計166,000小時語音的數據集。我們的新系列OWSM v4模型，在這一精選數據集及現有OWSM數據的基礎上進行訓練，在多語言基準測試中顯著超越了之前的版本。在多種場景下，我們的模型甚至與Whisper和MMS等前沿工業模型持平或超越。我們將通過ESPnet工具包公開發布清洗後的YODAS數據、預訓練模型及所有相關腳本。

English

The Open Whisper-style Speech Models (OWSM) project has developed a series of fully open speech foundation models using academic-scale resources, but their training data remains insufficient. This work enhances OWSM by integrating YODAS, a large-scale web-crawled dataset with a Creative Commons license. However, incorporating YODAS is nontrivial due to its wild nature, which introduces challenges such as incorrect language labels and audio-text misalignments. To address this, we develop a scalable data-cleaning pipeline using public toolkits, yielding a dataset with 166,000 hours of speech across 75 languages. Our new series of OWSM v4 models, trained on this curated dataset alongside existing OWSM data, significantly outperform previous versions on multilingual benchmarks. Our models even match or surpass frontier industrial models like Whisper and MMS in multiple scenarios. We will publicly release the cleaned YODAS data, pre-trained models, and all associated scripts via the ESPnet toolkit.

OWSM v4：通过数据扩展与清洗提升开放式Whisper风格语音模型

OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning

摘要

Support