エンコーダレベル知識蒸留を用いた効率的な音声キャプショニング

要旨

自動音声キャプショニング（AAC）において、最近のモデルにより大幅な改善が達成されています。しかし、これらのモデルは性能が向上するにつれてますます大規模化しています。本研究では、AACのための知識蒸留（KD）フレームワークを提案します。我々の分析によると、エンコーダ-デコーダベースのAACモデルにおいて、知識をデコーダではなくエンコーダに蒸留する方がより効果的です。この目的のために、標準的な教師あり損失とシーケンスレベルのKD損失に加えて、エンコーダレベルのKD損失をトレーニングに組み込みます。我々は、平均二乗誤差（MSE）損失とコントラスティブ損失に基づく2つのエンコーダレベルKD手法を調査します。実験結果は、コントラスティブKDがMSE KDよりもロバストであり、データが不足している状況で優れた性能を示すことを実証しています。KDフレームワークにおいて音声のみのデータをトレーニングに活用することで、我々の学生モデルは競争力のある性能を達成し、推論速度は19倍高速です。オンラインデモは\url{https://huggingface.co/spaces/wsntxxn/efficient_audio_captioning}で利用可能です。

English

Significant improvement has been achieved in automated audio captioning (AAC) with recent models. However, these models have become increasingly large as their performance is enhanced. In this work, we propose a knowledge distillation (KD) framework for AAC. Our analysis shows that in the encoder-decoder based AAC models, it is more effective to distill knowledge into the encoder as compared with the decoder. To this end, we incorporate encoder-level KD loss into training, in addition to the standard supervised loss and sequence-level KD loss. We investigate two encoder-level KD methods, based on mean squared error (MSE) loss and contrastive loss, respectively. Experimental results demonstrate that contrastive KD is more robust than MSE KD, exhibiting superior performance in data-scarce situations. By leveraging audio-only data into training in the KD framework, our student model achieves competitive performance, with an inference speed that is 19 times fasterAn online demo is available at \url{https://huggingface.co/spaces/wsntxxn/efficient_audio_captioning}.

エンコーダレベル知識蒸留を用いた効率的な音声キャプショニング

Efficient Audio Captioning with Encoder-Level Knowledge Distillation

要旨

Support