DinoSR：自我蒸餾與線上聚類用於自監督式語音表示學習

摘要

本文介紹自我蒸餾和在線聚類，用於自監督語音表示學習（DinoSR），結合了遮罩語言建模、自我蒸餾和在線聚類。我們展示了這些概念彼此補充，並產生了一個強大的語音表示學習模型。DinoSR首先從輸入音頻中使用教師網絡提取情境化嵌入，然後在嵌入上運行在線聚類系統，以產生機器發現的音素庫，最後使用離散化的標記來引導學生網絡。我們展示了DinoSR在幾個下游任務中超越了先前的最先進性能，並對模型和學習的離散單元進行了詳細分析。在匿名期結束後，源代碼將提供。

English

In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network. We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units. The source code will be made available after the anonymity period.

DinoSR：自我蒸餾與線上聚類用於自監督式語音表示學習

DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

摘要

Support