自监督语音模型对荷兰语了解多少?分析语言特定预训练的优势
What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training
June 1, 2025
作者: Marianne de Heer Kloots, Hosein Mohebbi, Charlotte Pouw, Gaofei Shen, Willem Zuidema, Martijn Bentum
cs.AI
摘要
自监督模型学习到的语音表征在多大程度上具有语言特异性?现有研究表明,仅通过语音录音训练的端到端模型能够成功解码一系列语言特征。然而,预训练在特定语言上是否能够提升语言特有的语言学信息,这一点尚不明确。本文测试了自监督Wav2Vec2模型内部表征对荷兰语语音和词汇信息的编码能力。与在相似量的英语数据或更大量的多语言数据上进行预训练相比,仅在荷兰语上进行预训练能更好地表征荷兰语的语言特征。这种语言特异性优势通过训练有素的聚类或分类探针能够被有效检测到,并且在使用零样本度量时也能部分观察到。此外,语言学特征编码上的语言特异性优势与自动语音识别的下游性能表现相一致。
English
How language-specific are speech representations learned by self-supervised
models? Existing work has shown that a range of linguistic features can be
successfully decoded from end-to-end models trained only on speech recordings.
However, it's less clear to what extent pre-training on specific languages
improves language-specific linguistic information. Here we test the encoding of
Dutch phonetic and lexical information in internal representations of
self-supervised Wav2Vec2 models. Pre-training exclusively on Dutch improves the
representation of Dutch linguistic features as compared to pre-training on
similar amounts of English or larger amounts of multilingual data. This
language-specific advantage is well-detected by trained clustering or
classification probes, and partially observable using zero-shot metrics.
Furthermore, the language-specific benefit on linguistic feature encoding
aligns with downstream performance on Automatic Speech Recognition.