利用編碼器層級知識蒸餾的高效音訊字幕生成
Efficient Audio Captioning with Encoder-Level Knowledge Distillation
July 19, 2024
作者: Xuenan Xu, Haohe Liu, Mengyue Wu, Wenwu Wang, Mark D. Plumbley
cs.AI
摘要
最近的模型在自動音訊字幕(AAC)方面取得了顯著的進展。然而,隨著性能的提升,這些模型變得越來越龐大。在這項工作中,我們提出了一個用於AAC的知識蒸餾(KD)框架。我們的分析顯示,在基於編碼器-解碼器的AAC模型中,將知識蒸餾到編碼器中比較解碼器更有效。為此,我們在訓練中加入了編碼器級別的知識蒸餾損失,除了標準監督損失和序列級別的知識蒸餾損失。我們研究了兩種基於編碼器級別的知識蒸餾方法,分別基於均方誤差(MSE)損失和對比損失。實驗結果表明,對比知識蒸餾比MSE知識蒸餾更具魯棒性,在數據稀缺情況下表現更優。通過在KD框架中利用僅音訊數據進行訓練,我們的學生模型實現了競爭性的性能,推理速度快了19倍。在以下網址提供了在線演示:\url{https://huggingface.co/spaces/wsntxxn/efficient_audio_captioning}。
English
Significant improvement has been achieved in automated audio captioning (AAC)
with recent models. However, these models have become increasingly large as
their performance is enhanced. In this work, we propose a knowledge
distillation (KD) framework for AAC. Our analysis shows that in the
encoder-decoder based AAC models, it is more effective to distill knowledge
into the encoder as compared with the decoder. To this end, we incorporate
encoder-level KD loss into training, in addition to the standard supervised
loss and sequence-level KD loss. We investigate two encoder-level KD methods,
based on mean squared error (MSE) loss and contrastive loss, respectively.
Experimental results demonstrate that contrastive KD is more robust than MSE
KD, exhibiting superior performance in data-scarce situations. By leveraging
audio-only data into training in the KD framework, our student model achieves
competitive performance, with an inference speed that is 19 times
fasterAn online demo is available at
\url{https://huggingface.co/spaces/wsntxxn/efficient_audio_captioning}.Summary
AI-Generated Summary