利用编码器级知识蒸馏实现高效音频字幕生成
Efficient Audio Captioning with Encoder-Level Knowledge Distillation
July 19, 2024
作者: Xuenan Xu, Haohe Liu, Mengyue Wu, Wenwu Wang, Mark D. Plumbley
cs.AI
摘要
最近的模型在自动音频字幕(AAC)方面取得了显著进展。然而,随着性能的提升,这些模型变得越来越庞大。在这项工作中,我们提出了一种用于AAC的知识蒸馏(KD)框架。我们的分析表明,在基于编码器-解码器的AAC模型中,与解码器相比,将知识蒸馏到编码器中更为有效。为此,我们在训练中加入了编码器级别的知识蒸馏损失,除了标准的监督损失和序列级别的知识蒸馏损失。我们研究了两种基于编码器级别的知识蒸馏方法,分别基于均方误差(MSE)损失和对比损失。实验结果表明,对比知识蒸馏比MSE知识蒸馏更加稳健,在数据稀缺情况下表现出更优越的性能。通过在KD框架中利用仅音频数据进行训练,我们的学生模型实现了竞争性能,推断速度快了19倍。可在以下网址查看在线演示:\url{https://huggingface.co/spaces/wsntxxn/efficient_audio_captioning}。
English
Significant improvement has been achieved in automated audio captioning (AAC)
with recent models. However, these models have become increasingly large as
their performance is enhanced. In this work, we propose a knowledge
distillation (KD) framework for AAC. Our analysis shows that in the
encoder-decoder based AAC models, it is more effective to distill knowledge
into the encoder as compared with the decoder. To this end, we incorporate
encoder-level KD loss into training, in addition to the standard supervised
loss and sequence-level KD loss. We investigate two encoder-level KD methods,
based on mean squared error (MSE) loss and contrastive loss, respectively.
Experimental results demonstrate that contrastive KD is more robust than MSE
KD, exhibiting superior performance in data-scarce situations. By leveraging
audio-only data into training in the KD framework, our student model achieves
competitive performance, with an inference speed that is 19 times
fasterAn online demo is available at
\url{https://huggingface.co/spaces/wsntxxn/efficient_audio_captioning}.Summary
AI-Generated Summary