注意力還是卷積：Transformer 編碼器在音訊語言模型中的推論效率

摘要

本文中，我們展示了一個簡單的自監督預訓練音頻模型能夠達到與更複雜的具有語音轉換編碼器的預訓練模型相當的推論效率。這些語音轉換器依賴於將卷積模組與自注意力模組相混合。它們在自動語音識別方面取得了最先進的性能並具有頂尖的效率。我們首先展示了將這些語音轉換器用作編碼器顯著提高了預訓練音頻模型的效率。然而，我們的研究表明，僅使用先進的自注意力就能達到可比擬的效率。我們證明了這種更簡單的方法在使用神經網絡的低比特權重量化技術來提高效率時特別有益。我們假設這可以防止在不同量化模組之間傳播錯誤，相較於最近將量化卷積和量化自注意力模組相混合的語音轉換器。

English

In this paper, we show that a simple self-supervised pre-trained audio model can achieve comparable inference efficiency to more complicated pre-trained models with speech transformer encoders. These speech transformers rely on mixing convolutional modules with self-attention modules. They achieve state-of-the-art performance on ASR with top efficiency. We first show that employing these speech transformers as an encoder significantly improves the efficiency of pre-trained audio models as well. However, our study shows that we can achieve comparable efficiency with advanced self-attention solely. We demonstrate that this simpler approach is particularly beneficial with a low-bit weight quantization technique of a neural network to improve efficiency. We hypothesize that it prevents propagating the errors between different quantized modules compared to recent speech transformers mixing quantized convolution and the quantized self-attention modules.

注意力還是卷積：Transformer 編碼器在音訊語言模型中的推論效率

Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency

摘要

Support