DinoSR:自我蒸馏和在线聚类用于自监督语音表示学习
DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning
May 17, 2023
作者: Alexander H. Liu, Heng-Jui Chang, Michael Auli, Wei-Ning Hsu, James R. Glass
cs.AI
摘要
本文介绍了自我蒸馏和在线聚类用于自监督语音表示学习(DinoSR),结合了掩码语言建模、自我蒸馏和在线聚类。我们展示了这些概念如何相互补充,并产生了一个强大的语音表示学习模型。DinoSR首先从输入音频中使用教师网络提取上下文嵌入,然后在嵌入上运行在线聚类系统,生成一个机器发现的音素库,最后使用离散化的标记指导学生网络。我们展示了DinoSR在多个下游任务中超越了先前的最先进性能,并提供了对模型和学习的离散单元的详细分析。匿名期结束后将提供源代码。
English
In this paper, we introduce self-distillation and online clustering for
self-supervised speech representation learning (DinoSR) which combines masked
language modeling, self-distillation, and online clustering. We show that these
concepts complement each other and result in a strong representation learning
model for speech. DinoSR first extracts contextualized embeddings from the
input audio with a teacher network, then runs an online clustering system on
the embeddings to yield a machine-discovered phone inventory, and finally uses
the discretized tokens to guide a student network. We show that DinoSR
surpasses previous state-of-the-art performance in several downstream tasks,
and provide a detailed analysis of the model and the learned discrete units.
The source code will be made available after the anonymity period.