Voxlect: 전 세계 방언과 지역 언어 모델링을 위한 음성 기반 모델 벤치마크

초록

우리는 음성 기반 모델을 활용하여 전 세계의 방언과 지역 언어를 모델링하기 위한 새로운 벤치마크인 Voxlect를 소개합니다. 구체적으로, 우리는 영어, 아랍어, 중국어(보통화 및 광둥어), 티베트어, 인도 언어, 태국어, 스페인어, 프랑스어, 독일어, 브라질 포르투갈어, 이탈리아어의 방언과 지역 언어 변이에 대한 포괄적인 벤치마크 평가를 보고합니다. 본 연구에서는 방언 정보가 제공된 30개의 공개 음성 코퍼스에서 추출한 200만 개 이상의 훈련 발화를 사용했습니다. 우리는 여러 널리 사용되는 음성 기반 모델의 방언 분류 성능을 평가하고, 잡음이 있는 조건에서 방언 모델의 견고성을 검증하며, 지리적 연속성과 일치하는 모델링 결과를 강조하는 오류 분석을 제시합니다. 또한, 방언 분류 벤치마킹 외에도 Voxlect를 통해 가능해진 여러 다운스트림 애플리케이션을 시연합니다. 구체적으로, Voxlect는 기존 음성 인식 데이터셋에 방언 정보를 추가하여 방언 변이에 따른 ASR 성능을 더 세부적으로 분석할 수 있도록 하는 데 활용될 수 있음을 보여줍니다. 또한, Voxlect는 음성 생성 시스템의 성능을 평가하는 도구로도 사용됩니다. Voxlect는 RAIL 라이선스 하에 https://github.com/tiantiaf0627/voxlect에서 공개적으로 이용 가능합니다.

English

We present Voxlect, a novel benchmark for modeling dialects and regional languages worldwide using speech foundation models. Specifically, we report comprehensive benchmark evaluations on dialects and regional language varieties in English, Arabic, Mandarin and Cantonese, Tibetan, Indic languages, Thai, Spanish, French, German, Brazilian Portuguese, and Italian. Our study used over 2 million training utterances from 30 publicly available speech corpora that are provided with dialectal information. We evaluate the performance of several widely used speech foundation models in classifying speech dialects. We assess the robustness of the dialectal models under noisy conditions and present an error analysis that highlights modeling results aligned with geographic continuity. In addition to benchmarking dialect classification, we demonstrate several downstream applications enabled by Voxlect. Specifically, we show that Voxlect can be applied to augment existing speech recognition datasets with dialect information, enabling a more detailed analysis of ASR performance across dialectal variations. Voxlect is also used as a tool to evaluate the performance of speech generation systems. Voxlect is publicly available with the license of the RAIL family at: https://github.com/tiantiaf0627/voxlect.