FAMA: 영어와 이탈리아어를 위한 최초의 대규모 오픈사이언스 음성 기반 모델

초록

Whisper와 SeamlessM4T와 같은 음성 기반 모델(Speech Foundation Models, SFMs)의 개발은 음성 처리 분야를 크게 발전시켰습니다. 그러나 이러한 모델들은 훈련 데이터와 코드에 접근할 수 없는 폐쇄적인 특성으로 인해 재현성과 공정한 평가에 큰 어려움을 야기하고 있습니다. 다른 분야에서는 오픈소스(OS) 코드와 데이터로 훈련된 완전히 투명한 모델을 개발함으로써 개방형 과학(open science)에 상당한 진전을 이루었지만, 음성 분야에서는 이와 유사한 노력이 여전히 제한적입니다. 이러한 격차를 메우기 위해, 우리는 영어와 이탈리아어를 위한 최초의 개방형 과학 SFM 패밀리인 FAMA를 소개합니다. 이 모델은 15만 시간 이상의 오픈소스 음성 데이터로 훈련되었습니다. 또한, 우리는 두 언어에 대해 총 16,000시간의 정제 및 의사 레이블(pseudo-labeled)된 음성 데이터를 포함한 새로운 데이터셋을 제시합니다. 실험 결과, FAMA는 기존 SFM들과 비교해 경쟁력 있는 성능을 보이면서도 최대 8배 빠른 속도를 달성했습니다. 코드, 데이터셋, 모델을 포함한 모든 아티팩트는 오픈소스 호환 라이선스로 공개되어, 음성 기술 연구의 개방성을 촉진합니다.

English

The development of speech foundation models (SFMs) like Whisper and SeamlessM4T has significantly advanced the field of speech processing. However, their closed nature--with inaccessible training data and code--poses major reproducibility and fair evaluation challenges. While other domains have made substantial progress toward open science by developing fully transparent models trained on open-source (OS) code and data, similar efforts in speech remain limited. To fill this gap, we introduce FAMA, the first family of open science SFMs for English and Italian, trained on 150k+ hours of OS speech data. Moreover, we present a new dataset containing 16k hours of cleaned and pseudo-labeled speech for both languages. Results show that FAMA achieves competitive performance compared to existing SFMs while being up to 8 times faster. All artifacts, including code, datasets, and models, are released under OS-compliant licenses, promoting openness in speech technology research.

FAMA: 영어와 이탈리아어를 위한 최초의 대규모 오픈사이언스 음성 기반 모델

FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian

초록

Support