Distil-Whisper:透過大規模虛擬標記實現強健的知識蒸餾
Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling
November 1, 2023
作者: Sanchit Gandhi, Patrick von Platen, Alexander M. Rush
cs.AI
摘要
隨著預訓練語音識別模型的尺寸增加,將這些大型模型在低延遲或資源受限的環境中運行變得具有挑戰性。在這項工作中,我們利用虛標記技術來建立一個大規模的開源數據集,用於將 Whisper 模型提煉為一個較小的變體,稱為 Distil-Whisper。通過使用簡單的字錯率(WER)啟發式方法,我們僅選擇最高質量的虛標記進行訓練。提煉後的模型速度提高了 5.8 倍,參數減少了 51%,在零-shot轉移設置中,在分布外測試數據上的 WER 只有 1% 的差距。Distil-Whisper 保持了 Whisper 模型對困難聲學條件的韌性,同時在長篇音頻上較不容易出現幻聽錯誤。Distil-Whisper 設計用於與 Whisper 搭配進行推測解碼,實現了 2 倍的加速,同時在數學上確保了與原始模型相同的輸出。為了促進該領域的進一步研究,我們將我們的訓練代碼、推斷代碼和模型公開提供。
English
As the size of pre-trained speech recognition models increases, running these
large models in low-latency or resource-constrained environments becomes
challenging. In this work, we leverage pseudo-labelling to assemble a
large-scale open-source dataset which we use to distill the Whisper model into
a smaller variant, called Distil-Whisper. Using a simple word error rate (WER)
heuristic, we select only the highest quality pseudo-labels for training. The
distilled model is 5.8 times faster with 51% fewer parameters, while performing
to within 1% WER on out-of-distribution test data in a zero-shot transfer
setting. Distil-Whisper maintains the robustness of the Whisper model to
difficult acoustic conditions, while being less prone to hallucination errors
on long-form audio. Distil-Whisper is designed to be paired with Whisper for
speculative decoding, yielding a 2 times speed-up while mathematically ensuring
the same outputs as the original model. To facilitate further research in this
domain, we make our training code, inference code and models publicly
accessible.