自监督语音模型对荷兰语了解多少?分析语言特定预训练的优势
What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training
June 1, 2025
作者: Marianne de Heer Kloots, Hosein Mohebbi, Charlotte Pouw, Gaofei Shen, Willem Zuidema, Martijn Bentum
cs.AI
摘要
自監督模型學習的語音表徵在多大程度上具有語言特異性?已有研究表明,僅通過語音錄音訓練的端到端模型能夠成功解碼出一系列語言特徵。然而,針對特定語言的預訓練在多大程度上增強了語言特異性信息的捕捉,尚不明確。本文探討了自監督Wav2Vec2模型內部表徵中荷蘭語語音及詞彙信息的編碼情況。與在相似量的英語或更大規模的多語言數據上進行預訓練相比,僅在荷蘭語上進行預訓練顯著提升了對荷蘭語語言特徵的表達能力。這一語言特異性優勢可通過訓練有素的聚類或分類探針有效檢測,並在零樣本度量中部分顯現。此外,語言特異性對語言特徵編碼的益處與自動語音識別的下游任務表現相吻合。
English
How language-specific are speech representations learned by self-supervised
models? Existing work has shown that a range of linguistic features can be
successfully decoded from end-to-end models trained only on speech recordings.
However, it's less clear to what extent pre-training on specific languages
improves language-specific linguistic information. Here we test the encoding of
Dutch phonetic and lexical information in internal representations of
self-supervised Wav2Vec2 models. Pre-training exclusively on Dutch improves the
representation of Dutch linguistic features as compared to pre-training on
similar amounts of English or larger amounts of multilingual data. This
language-specific advantage is well-detected by trained clustering or
classification probes, and partially observable using zero-shot metrics.
Furthermore, the language-specific benefit on linguistic feature encoding
aligns with downstream performance on Automatic Speech Recognition.