Voxlect:面向全球方言与区域语言建模的语音基础模型基准
Voxlect: A Speech Foundation Model Benchmark for Modeling Dialects and Regional Languages Around the Globe
August 3, 2025
作者: Tiantian Feng, Kevin Huang, Anfeng Xu, Xuan Shi, Thanathai Lertpetchpun, Jihwan Lee, Yoonjeong Lee, Dani Byrd, Shrikanth Narayanan
cs.AI
摘要
我们推出了Voxlect,一个利用语音基础模型对全球方言及区域语言进行建模的全新基准。具体而言,我们报告了针对英语、阿拉伯语、普通话与粤语、藏语、印度诸语言、泰语、西班牙语、法语、德语、巴西葡萄牙语及意大利语中方言与区域语言变体的全面基准评估。本研究采用了来自30个公开可获取的、附带方言信息的语音语料库,总计超过200万条训练语句。我们评估了多种广泛使用的语音基础模型在方言分类任务中的表现,考察了方言模型在噪声环境下的鲁棒性,并通过错误分析揭示了与地理连续性相一致的建模结果。除了方言分类基准测试外,我们还展示了Voxlect支持的若干下游应用。具体来说,我们证明了Voxlect可用于增强现有语音识别数据集,添加方言信息,从而更细致地分析跨方言变体的自动语音识别(ASR)性能。此外,Voxlect还作为评估语音生成系统性能的工具。Voxlect已公开提供,遵循RAIL系列许可,访问地址为:https://github.com/tiantiaf0627/voxlect。
English
We present Voxlect, a novel benchmark for modeling dialects and regional
languages worldwide using speech foundation models. Specifically, we report
comprehensive benchmark evaluations on dialects and regional language varieties
in English, Arabic, Mandarin and Cantonese, Tibetan, Indic languages, Thai,
Spanish, French, German, Brazilian Portuguese, and Italian. Our study used over
2 million training utterances from 30 publicly available speech corpora that
are provided with dialectal information. We evaluate the performance of several
widely used speech foundation models in classifying speech dialects. We assess
the robustness of the dialectal models under noisy conditions and present an
error analysis that highlights modeling results aligned with geographic
continuity. In addition to benchmarking dialect classification, we demonstrate
several downstream applications enabled by Voxlect. Specifically, we show that
Voxlect can be applied to augment existing speech recognition datasets with
dialect information, enabling a more detailed analysis of ASR performance
across dialectal variations. Voxlect is also used as a tool to evaluate the
performance of speech generation systems. Voxlect is publicly available with
the license of the RAIL family at: https://github.com/tiantiaf0627/voxlect.