DinoSR: 자기 지도 음성 표현 학습을 위한 자기 지식 증류 및 온라인 클러스터링

초록

본 논문에서는 마스크 언어 모델링, 자기 지식 증류, 온라인 클러스터링을 결합한 자기 지도 음성 표현 학습(DinoSR)을 위한 자기 지식 증류와 온라인 클러스터링을 소개합니다. 우리는 이러한 개념들이 서로 보완적으로 작용하여 강력한 음성 표현 학습 모델을 만든다는 것을 보여줍니다. DinoSR은 먼저 교사 네트워크를 사용하여 입력 오디오에서 문맥적 임베딩을 추출한 다음, 이 임베딩에 온라인 클러스터링 시스템을 실행하여 기계가 발견한 음소 인벤토리를 생성하고, 마지막으로 이산화된 토큰을 사용하여 학생 네트워크를 안내합니다. 우리는 DinoSR이 여러 다운스트림 작업에서 이전의 최첨단 성능을 능가한다는 것을 보여주고, 모델과 학습된 이산 단위에 대한 상세한 분석을 제공합니다. 소스 코드는 익명 기간 이후에 공개될 예정입니다.

English

In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network. We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units. The source code will be made available after the anonymity period.

DinoSR: 자기 지도 음성 표현 학습을 위한 자기 지식 증류 및 온라인 클러스터링

DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

초록

Support