DinoSR:自我蒸餾與線上聚類用於自監督式語音表示學習
DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning
May 17, 2023
作者: Alexander H. Liu, Heng-Jui Chang, Michael Auli, Wei-Ning Hsu, James R. Glass
cs.AI
摘要
本文介紹自我蒸餾和在線聚類,用於自監督語音表示學習(DinoSR),結合了遮罩語言建模、自我蒸餾和在線聚類。我們展示了這些概念彼此補充,並產生了一個強大的語音表示學習模型。DinoSR首先從輸入音頻中使用教師網絡提取情境化嵌入,然後在嵌入上運行在線聚類系統,以產生機器發現的音素庫,最後使用離散化的標記來引導學生網絡。我們展示了DinoSR在幾個下游任務中超越了先前的最先進性能,並對模型和學習的離散單元進行了詳細分析。在匿名期結束後,源代碼將提供。
English
In this paper, we introduce self-distillation and online clustering for
self-supervised speech representation learning (DinoSR) which combines masked
language modeling, self-distillation, and online clustering. We show that these
concepts complement each other and result in a strong representation learning
model for speech. DinoSR first extracts contextualized embeddings from the
input audio with a teacher network, then runs an online clustering system on
the embeddings to yield a machine-discovered phone inventory, and finally uses
the discretized tokens to guide a student network. We show that DinoSR
surpasses previous state-of-the-art performance in several downstream tasks,
and provide a detailed analysis of the model and the learned discrete units.
The source code will be made available after the anonymity period.