Distil-Whisper：透過大規模虛擬標記實現強健的知識蒸餾

摘要

隨著預訓練語音識別模型的尺寸增加，將這些大型模型在低延遲或資源受限的環境中運行變得具有挑戰性。在這項工作中，我們利用虛標記技術來建立一個大規模的開源數據集，用於將 Whisper 模型提煉為一個較小的變體，稱為 Distil-Whisper。通過使用簡單的字錯率（WER）啟發式方法，我們僅選擇最高質量的虛標記進行訓練。提煉後的模型速度提高了 5.8 倍，參數減少了 51%，在零-shot轉移設置中，在分布外測試數據上的 WER 只有 1% 的差距。Distil-Whisper 保持了 Whisper 模型對困難聲學條件的韌性，同時在長篇音頻上較不容易出現幻聽錯誤。Distil-Whisper 設計用於與 Whisper 搭配進行推測解碼，實現了 2 倍的加速，同時在數學上確保了與原始模型相同的輸出。為了促進該領域的進一步研究，我們將我們的訓練代碼、推斷代碼和模型公開提供。

English

As the size of pre-trained speech recognition models increases, running these large models in low-latency or resource-constrained environments becomes challenging. In this work, we leverage pseudo-labelling to assemble a large-scale open-source dataset which we use to distill the Whisper model into a smaller variant, called Distil-Whisper. Using a simple word error rate (WER) heuristic, we select only the highest quality pseudo-labels for training. The distilled model is 5.8 times faster with 51% fewer parameters, while performing to within 1% WER on out-of-distribution test data in a zero-shot transfer setting. Distil-Whisper maintains the robustness of the Whisper model to difficult acoustic conditions, while being less prone to hallucination errors on long-form audio. Distil-Whisper is designed to be paired with Whisper for speculative decoding, yielding a 2 times speed-up while mathematically ensuring the same outputs as the original model. To facilitate further research in this domain, we make our training code, inference code and models publicly accessible.

Distil-Whisper：透過大規模虛擬標記實現強健的知識蒸餾

Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling

摘要

Support