Distil-Whisper:通过大规模伪标记实现稳健知识蒸馏
Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling
November 1, 2023
作者: Sanchit Gandhi, Patrick von Platen, Alexander M. Rush
cs.AI
摘要
随着预训练语音识别模型的规模增大,在低延迟或资源受限的环境中运行这些大型模型变得具有挑战性。在这项工作中,我们利用伪标记来构建一个大规模开源数据集,用于将Whisper模型提炼成一个更小的变体,称为Distil-Whisper。通过使用简单的词错误率(WER)启发式方法,我们仅选择最高质量的伪标签进行训练。提炼后的模型速度提升了5.8倍,参数数量减少了51%,同时在零次转移设置中,在分布外测试数据上的WER仅相差1%。Distil-Whisper保持了Whisper模型对复杂声学条件的稳健性,同时在长形音频上减少了幻觉错误的倾向。Distil-Whisper旨在与Whisper配对进行推测解码,从而实现2倍速度提升,同时在数学上确保与原始模型相同的输出。为了促进该领域的进一步研究,我们公开了我们的训练代码、推理代码和模型。
English
As the size of pre-trained speech recognition models increases, running these
large models in low-latency or resource-constrained environments becomes
challenging. In this work, we leverage pseudo-labelling to assemble a
large-scale open-source dataset which we use to distill the Whisper model into
a smaller variant, called Distil-Whisper. Using a simple word error rate (WER)
heuristic, we select only the highest quality pseudo-labels for training. The
distilled model is 5.8 times faster with 51% fewer parameters, while performing
to within 1% WER on out-of-distribution test data in a zero-shot transfer
setting. Distil-Whisper maintains the robustness of the Whisper model to
difficult acoustic conditions, while being less prone to hallucination errors
on long-form audio. Distil-Whisper is designed to be paired with Whisper for
speculative decoding, yielding a 2 times speed-up while mathematically ensuring
the same outputs as the original model. To facilitate further research in this
domain, we make our training code, inference code and models publicly
accessible.