FAMA：首个面向英语和意大利语的大规模开放科学语音基础模型

摘要

诸如Whisper和SeamlessM4T等语音基础模型（SFMs）的发展，极大地推动了语音处理领域的进步。然而，这些模型的封闭性——训练数据和代码不可获取——带来了重大的可复现性和公平评估挑战。尽管其他领域通过开发基于开源（OS）代码和数据训练的完全透明模型，在开放科学方面取得了显著进展，但语音领域的类似努力仍显不足。为填补这一空白，我们推出了FAMA，这是首个面向英语和意大利语的开放科学SFM家族，基于超过15万小时的开源语音数据训练而成。此外，我们发布了一个包含1.6万小时经过清洗和伪标注语音的新数据集，涵盖上述两种语言。实验结果表明，FAMA在保持与现有SFMs竞争性能的同时，速度提升高达8倍。所有成果，包括代码、数据集和模型，均以符合开源标准的许可发布，旨在促进语音技术研究的开放性。

English

The development of speech foundation models (SFMs) like Whisper and SeamlessM4T has significantly advanced the field of speech processing. However, their closed nature--with inaccessible training data and code--poses major reproducibility and fair evaluation challenges. While other domains have made substantial progress toward open science by developing fully transparent models trained on open-source (OS) code and data, similar efforts in speech remain limited. To fill this gap, we introduce FAMA, the first family of open science SFMs for English and Italian, trained on 150k+ hours of OS speech data. Moreover, we present a new dataset containing 16k hours of cleaned and pseudo-labeled speech for both languages. Results show that FAMA achieves competitive performance compared to existing SFMs while being up to 8 times faster. All artifacts, including code, datasets, and models, are released under OS-compliant licenses, promoting openness in speech technology research.

FAMA：首个面向英语和意大利语的大规模开放科学语音基础模型

FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian

摘要

Support