FAMA:首個面向英語與義大利語的大規模開放科學語音基礎模型
FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian
May 28, 2025
作者: Sara Papi, Marco Gaido, Luisa Bentivogli, Alessio Brutti, Mauro Cettolo, Roberto Gretter, Marco Matassoni, Mohamed Nabih, Matteo Negri
cs.AI
摘要
語音基礎模型(SFMs)如Whisper和SeamlessM4T的發展,顯著推進了語音處理領域的進步。然而,這些模型的封閉性——無法獲取的訓練數據和代碼——帶來了重大的可重現性和公平評估挑戰。儘管其他領域通過開發基於開源(OS)代碼和數據訓練的完全透明模型,在開放科學方面取得了實質性進展,但語音領域的類似努力仍然有限。為填補這一空白,我們推出了FAMA,這是首個針對英語和意大利語的開放科學SFM家族,訓練時使用了超過15萬小時的開源語音數據。此外,我們還提供了一個包含1.6萬小時經過清理和偽標記的語音新數據集,涵蓋這兩種語言。結果顯示,FAMA在與現有SFMs相比時,展現出競爭力的性能,同時速度提升高達8倍。所有成果,包括代碼、數據集和模型,均按照符合開源標準的許可證發布,推動了語音技術研究的開放性。
English
The development of speech foundation models (SFMs) like Whisper and
SeamlessM4T has significantly advanced the field of speech processing. However,
their closed nature--with inaccessible training data and code--poses major
reproducibility and fair evaluation challenges. While other domains have made
substantial progress toward open science by developing fully transparent models
trained on open-source (OS) code and data, similar efforts in speech remain
limited. To fill this gap, we introduce FAMA, the first family of open science
SFMs for English and Italian, trained on 150k+ hours of OS speech data.
Moreover, we present a new dataset containing 16k hours of cleaned and
pseudo-labeled speech for both languages. Results show that FAMA achieves
competitive performance compared to existing SFMs while being up to 8 times
faster. All artifacts, including code, datasets, and models, are released under
OS-compliant licenses, promoting openness in speech technology research.Summary
AI-Generated Summary