ChatPaper.aiChatPaper

Distil-Whisper:透過大規模虛擬標記實現強健的知識蒸餾

Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling

November 1, 2023
作者: Sanchit Gandhi, Patrick von Platen, Alexander M. Rush
cs.AI

摘要

隨著預訓練語音識別模型的尺寸增加,將這些大型模型在低延遲或資源受限的環境中運行變得具有挑戰性。在這項工作中,我們利用虛標記技術來建立一個大規模的開源數據集,用於將 Whisper 模型提煉為一個較小的變體,稱為 Distil-Whisper。通過使用簡單的字錯率(WER)啟發式方法,我們僅選擇最高質量的虛標記進行訓練。提煉後的模型速度提高了 5.8 倍,參數減少了 51%,在零-shot轉移設置中,在分布外測試數據上的 WER 只有 1% 的差距。Distil-Whisper 保持了 Whisper 模型對困難聲學條件的韌性,同時在長篇音頻上較不容易出現幻聽錯誤。Distil-Whisper 設計用於與 Whisper 搭配進行推測解碼,實現了 2 倍的加速,同時在數學上確保了與原始模型相同的輸出。為了促進該領域的進一步研究,我們將我們的訓練代碼、推斷代碼和模型公開提供。
English
As the size of pre-trained speech recognition models increases, running these large models in low-latency or resource-constrained environments becomes challenging. In this work, we leverage pseudo-labelling to assemble a large-scale open-source dataset which we use to distill the Whisper model into a smaller variant, called Distil-Whisper. Using a simple word error rate (WER) heuristic, we select only the highest quality pseudo-labels for training. The distilled model is 5.8 times faster with 51% fewer parameters, while performing to within 1% WER on out-of-distribution test data in a zero-shot transfer setting. Distil-Whisper maintains the robustness of the Whisper model to difficult acoustic conditions, while being less prone to hallucination errors on long-form audio. Distil-Whisper is designed to be paired with Whisper for speculative decoding, yielding a 2 times speed-up while mathematically ensuring the same outputs as the original model. To facilitate further research in this domain, we make our training code, inference code and models publicly accessible.
PDF582December 15, 2024