利用歌曲提升哈萨克语自动语音识别性能

摘要

针对低资源语言开发自动语音识别(ASR)系统常受限于转录语料匮乏的问题。本概念验证研究探索将歌曲作为哈萨克语ASR非常规但潜力巨大的数据源。我们以歌词行级别切分方式，从36位艺术家的195首歌曲中构建了包含3,013个音频-文本对（约4.5小时）的数据集。以Whisper作为基础识别器，我们在涉及歌曲、Common Voice语料库(CVC)和FLEURS的七种训练场景下微调模型，并在CVC、FLEURS及哈萨克语语音语料库2(KSC2)三个基准上进行评估。结果表明，基于歌曲的微调相较于零样本基线能提升性能。例如，在混合使用歌曲、CVC和FLEURS训练的Whisper Large-V3 Turbo模型，在CVC上达到27.6%的归一化词错误率，在FLEURS上为11.8%，同时在KSC2上的错误率较零样本模型降低一半（39.3% vs 81.2%）。虽然这些增益仍低于使用1,100小时KSC2语料库训练的模型，但证明即使少量歌曲-语音混合数据也能为低资源ASR带来有意义的适应性提升。该数据集已在Hugging Face平台以受限非商业许可发布供研究使用。

English

Developing automatic speech recognition (ASR) systems for low-resource languages is hindered by the scarcity of transcribed corpora. This proof-of-concept study explores songs as an unconventional yet promising data source for Kazakh ASR. We curate a dataset of 3,013 audio-text pairs (about 4.5 hours) from 195 songs by 36 artists, segmented at the lyric-line level. Using Whisper as the base recogniser, we fine-tune models under seven training scenarios involving Songs, Common Voice Corpus (CVC), and FLEURS, and evaluate them on three benchmarks: CVC, FLEURS, and Kazakh Speech Corpus 2 (KSC2). Results show that song-based fine-tuning improves performance over zero-shot baselines. For instance, Whisper Large-V3 Turbo trained on a mixture of Songs, CVC, and FLEURS achieves 27.6% normalised WER on CVC and 11.8% on FLEURS, while halving the error on KSC2 (39.3% vs. 81.2%) relative to the zero-shot model. Although these gains remain below those of models trained on the 1,100-hour KSC2 corpus, they demonstrate that even modest song-speech mixtures can yield meaningful adaptation improvements in low-resource ASR. The dataset is released on Hugging Face for research purposes under a gated, non-commercial licence.