Distil-Whisper: 대규모 의사 레이블링을 통한 강건한 지식 증류

초록

사전 학습된 음성 인식 모델의 크기가 증가함에 따라, 이러한 대형 모델을 낮은 지연 시간이나 자원이 제한된 환경에서 실행하는 것은 어려운 과제가 되었습니다. 본 연구에서는 의사 레이블링(pseudo-labelling)을 활용하여 대규모 오픈소스 데이터셋을 구축하고, 이를 사용해 Whisper 모델을 더 작은 변형인 Distil-Whisper로 증류(distill)하였습니다. 간단한 단어 오류율(WER) 휴리스틱을 사용하여, 훈련에 사용할 최고 품질의 의사 레이블만을 선별했습니다. 증류된 모델은 5.8배 더 빠르고 매개변수가 51% 더 적으며, 제로샷 전이(zero-shot transfer) 설정에서 분포 외(out-of-distribution) 테스트 데이터에 대해 WER이 1% 이내로 유지됩니다. Distil-Whisper는 Whisper 모델의 어려운 음향 조건에 대한 견고성을 유지하면서도, 장시간 오디오에서의 환각(hallucination) 오류에 덜 취약합니다. Distil-Whisper는 Whisper와 함께 추측 디코딩(speculative decoding)을 위해 설계되어, 원본 모델과 동일한 출력을 수학적으로 보장하면서도 2배의 속도 향상을 제공합니다. 이 분야의 추가 연구를 촉진하기 위해, 우리는 훈련 코드, 추론 코드 및 모델을 공개적으로 제공합니다.

English

As the size of pre-trained speech recognition models increases, running these large models in low-latency or resource-constrained environments becomes challenging. In this work, we leverage pseudo-labelling to assemble a large-scale open-source dataset which we use to distill the Whisper model into a smaller variant, called Distil-Whisper. Using a simple word error rate (WER) heuristic, we select only the highest quality pseudo-labels for training. The distilled model is 5.8 times faster with 51% fewer parameters, while performing to within 1% WER on out-of-distribution test data in a zero-shot transfer setting. Distil-Whisper maintains the robustness of the Whisper model to difficult acoustic conditions, while being less prone to hallucination errors on long-form audio. Distil-Whisper is designed to be paired with Whisper for speculative decoding, yielding a 2 times speed-up while mathematically ensuring the same outputs as the original model. To facilitate further research in this domain, we make our training code, inference code and models publicly accessible.

Distil-Whisper: 대규모 의사 레이블링을 통한 강건한 지식 증류

Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling

초록

Support