DinoSR：自我蒸馏和在线聚类用于自监督语音表示学习

摘要

本文介绍了自我蒸馏和在线聚类用于自监督语音表示学习（DinoSR），结合了掩码语言建模、自我蒸馏和在线聚类。我们展示了这些概念如何相互补充，并产生了一个强大的语音表示学习模型。DinoSR首先从输入音频中使用教师网络提取上下文嵌入，然后在嵌入上运行在线聚类系统，生成一个机器发现的音素库，最后使用离散化的标记指导学生网络。我们展示了DinoSR在多个下游任务中超越了先前的最先进性能，并提供了对模型和学习的离散单元的详细分析。匿名期结束后将提供源代码。

English

In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network. We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units. The source code will be made available after the anonymity period.

DinoSR：自我蒸馏和在线聚类用于自监督语音表示学习

DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

摘要

Support