OWSM v4：通过数据扩展与清洗优化开放式Whisper风格语音模型

摘要

开源Whisper风格语音模型（OWSM）项目利用学术级资源开发了一系列完全开放的语音基础模型，但其训练数据仍显不足。本研究通过整合YODAS——一个拥有知识共享许可的大规模网络爬取数据集，对OWSM进行了增强。然而，YODAS的原始特性带来了诸如错误语言标签和音频文本不对齐等挑战，使得其整合并非易事。为此，我们构建了一个基于公共工具包的可扩展数据清洗流程，最终生成了一个包含75种语言、总计166,000小时语音的清洗后数据集。基于这一精选数据集及现有OWSM数据训练的新一代OWSM v4模型，在多语言基准测试中显著超越了前代版本。在多种场景下，我们的模型甚至与Whisper和MMS等前沿工业模型持平或超越。我们将通过ESPnet工具包公开发布清洗后的YODAS数据、预训练模型及所有相关脚本。

English

The Open Whisper-style Speech Models (OWSM) project has developed a series of fully open speech foundation models using academic-scale resources, but their training data remains insufficient. This work enhances OWSM by integrating YODAS, a large-scale web-crawled dataset with a Creative Commons license. However, incorporating YODAS is nontrivial due to its wild nature, which introduces challenges such as incorrect language labels and audio-text misalignments. To address this, we develop a scalable data-cleaning pipeline using public toolkits, yielding a dataset with 166,000 hours of speech across 75 languages. Our new series of OWSM v4 models, trained on this curated dataset alongside existing OWSM data, significantly outperform previous versions on multilingual benchmarks. Our models even match or surpass frontier industrial models like Whisper and MMS in multiple scenarios. We will publicly release the cleaned YODAS data, pre-trained models, and all associated scripts via the ESPnet toolkit.

OWSM v4：通过数据扩展与清洗优化开放式Whisper风格语音模型

OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning

摘要

Support