カザフ語自動音声認識の改善における歌曲の活用

要旨

低リソース言語向け自動音声認識（ASR）システムの開発は、転写済みコーパスの不足によって妨げられている。本実証研究では、カザフ語ASRにおける非従来型ながら有望なデータソースとしての歌曲の可能性を探る。36名のアーティストによる195曲から、歌詞行単位で分割した3,013組の音声-テキストペア（約4.5時間）からなるデータセットを構築した。ベース認識エンジンとしてWhisperを使用し、歌曲、Common Voiceコーパス（CVC）、FLEURSを組み合わせた7つの学習シナリオでモデルのファインチューニングを行い、CVC、FLEURS、カザフ語音声コーパス2（KSC2）の3つのベンチマークで評価した。結果は、歌曲ベースのファインチューニングがゼロショットベースラインを上回る性能向上をもたらすことを示している。例えば、歌曲、CVC、FLEURSを混合して学習したWhisper Large-V3 Turboは、CVCで27.6%、FLEURSで11.8%の正規化WERを達成し、KSC2ではゼロショットモデルと比較して誤り率を半減させた（39.3% vs. 81.2%）。これらの改善度は1,100時間のKSC2コーパスで学習したモデルには及ばないものの、ごく少量の歌曲-音声混合データでも、低リソースASRにおいて意味のある適応改善が得られることを実証している。本データセットは、ゲーテッドな非商用ライセンスの下、研究目的でHugging Face上に公開されている。

English

Developing automatic speech recognition (ASR) systems for low-resource languages is hindered by the scarcity of transcribed corpora. This proof-of-concept study explores songs as an unconventional yet promising data source for Kazakh ASR. We curate a dataset of 3,013 audio-text pairs (about 4.5 hours) from 195 songs by 36 artists, segmented at the lyric-line level. Using Whisper as the base recogniser, we fine-tune models under seven training scenarios involving Songs, Common Voice Corpus (CVC), and FLEURS, and evaluate them on three benchmarks: CVC, FLEURS, and Kazakh Speech Corpus 2 (KSC2). Results show that song-based fine-tuning improves performance over zero-shot baselines. For instance, Whisper Large-V3 Turbo trained on a mixture of Songs, CVC, and FLEURS achieves 27.6% normalised WER on CVC and 11.8% on FLEURS, while halving the error on KSC2 (39.3% vs. 81.2%) relative to the zero-shot model. Although these gains remain below those of models trained on the 1,100-hour KSC2 corpus, they demonstrate that even modest song-speech mixtures can yield meaningful adaptation improvements in low-resource ASR. The dataset is released on Hugging Face for research purposes under a gated, non-commercial licence.

カザフ語自動音声認識の改善における歌曲の活用

Using Songs to Improve Kazakh Automatic Speech Recognition

要旨

Support