利用歌曲提升哈萨克语自动语音识别效能
Using Songs to Improve Kazakh Automatic Speech Recognition
March 1, 2026
作者: Rustem Yeshpanov
cs.AI
摘要
针对低资源语言开发自动语音识别系统常因转录语料匮乏而受阻。本研究提出概念验证,探索将歌曲作为哈萨克语ASR的非传统但潜力巨大的数据源。我们从36位艺术家的195首歌曲中构建了3,013个音频-文本对数据集(约4.5小时),并按歌词行进行分段。以Whisper为基础识别器,我们在包含歌曲、Common Voice语料库和FLEURS的七种训练场景下微调模型,并在CVC、FLEURS及哈萨克语语音库2三大基准上进行评估。结果显示:基于歌曲的微调较零样本基线模型性能显著提升。例如,采用歌曲、CVC和FLEURS混合数据训练的Whisper Large-V3 Turbo模型,在CVC上实现27.6%的标准化词错率,FLEURS上为11.8%,而在KSC2上的错误率较零样本模型降低一半(39.3% vs 81.2%)。尽管这些提升仍低于基于1,100小时KSC2语料库训练的模型,但证明即使少量歌曲-语音混合数据也能为低资源ASR带来有意义的适应性改进。该数据集已通过门控非商业许可在Hugging Face平台发布供研究使用。
English
Developing automatic speech recognition (ASR) systems for low-resource languages is hindered by the scarcity of transcribed corpora. This proof-of-concept study explores songs as an unconventional yet promising data source for Kazakh ASR. We curate a dataset of 3,013 audio-text pairs (about 4.5 hours) from 195 songs by 36 artists, segmented at the lyric-line level. Using Whisper as the base recogniser, we fine-tune models under seven training scenarios involving Songs, Common Voice Corpus (CVC), and FLEURS, and evaluate them on three benchmarks: CVC, FLEURS, and Kazakh Speech Corpus 2 (KSC2). Results show that song-based fine-tuning improves performance over zero-shot baselines. For instance, Whisper Large-V3 Turbo trained on a mixture of Songs, CVC, and FLEURS achieves 27.6% normalised WER on CVC and 11.8% on FLEURS, while halving the error on KSC2 (39.3% vs. 81.2%) relative to the zero-shot model. Although these gains remain below those of models trained on the 1,100-hour KSC2 corpus, they demonstrate that even modest song-speech mixtures can yield meaningful adaptation improvements in low-resource ASR. The dataset is released on Hugging Face for research purposes under a gated, non-commercial licence.