자기 지도 학습 음성 모델은 네덜란드어에 대해 무엇을 알고 있는가? 언어 특화 사전 학습의 장점 분석

초록

자기 지도 학습 모델이 학습한 음성 표현이 언어에 따라 얼마나 특정적인가? 기존 연구에서는 음성 녹음만으로 훈련된 종단 간 모델에서 다양한 언어학적 특성을 성공적으로 디코딩할 수 있음이 밝혀졌다. 그러나 특정 언어에 대한 사전 훈련이 언어별 언어학적 정보를 어느 정도 개선시키는지는 덜 명확하다. 본 연구에서는 자기 지도 학습 Wav2Vec2 모델의 내부 표현에서 네덜란드어의 음운 및 어휘 정보가 어떻게 인코딩되는지를 테스트한다. 네덜란드어로만 사전 훈련을 수행한 경우, 비슷한 양의 영어 또는 더 많은 양의 다국어 데이터로 사전 훈련한 경우와 비교하여 네덜란드어 언어학적 특성의 표현이 개선됨을 확인하였다. 이러한 언어별 이점은 훈련된 클러스터링 또는 분류 프로브를 통해 잘 감지되며, 제로샷 메트릭을 사용하여 부분적으로 관찰할 수 있다. 또한, 언어학적 특성 인코딩에서의 언어별 이점은 자동 음성 인식의 하류 작업 성능과 일치한다.

English

How language-specific are speech representations learned by self-supervised models? Existing work has shown that a range of linguistic features can be successfully decoded from end-to-end models trained only on speech recordings. However, it's less clear to what extent pre-training on specific languages improves language-specific linguistic information. Here we test the encoding of Dutch phonetic and lexical information in internal representations of self-supervised Wav2Vec2 models. Pre-training exclusively on Dutch improves the representation of Dutch linguistic features as compared to pre-training on similar amounts of English or larger amounts of multilingual data. This language-specific advantage is well-detected by trained clustering or classification probes, and partially observable using zero-shot metrics. Furthermore, the language-specific benefit on linguistic feature encoding aligns with downstream performance on Automatic Speech Recognition.

자기 지도 학습 음성 모델은 네덜란드어에 대해 무엇을 알고 있는가? 언어 특화 사전 학습의 장점 분석

What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training

초록

Support