注意力还是卷积：用于推断效率的音频语言模型中的Transformer编码器

摘要

本文表明，一个简单的自监督预训练音频模型能够达到与具有语音变换器编码器的更复杂预训练模型相媲美的推理效率。这些语音变换器依赖于将卷积模块与自注意力模块相结合。它们在自动语音识别方面取得了最先进的性能，并具有最高的效率。我们首先展示了将这些语音变换器作为编码器的应用显著提高了预训练音频模型的效率。然而，我们的研究表明，仅使用先进的自注意力即可达到可比较的效率。我们证明了这种更简单的方法在使用神经网络的低比特权重量化技术来提高效率时特别有益。我们假设这可以防止在不同量化模块之间传播错误，与最近的语音变换器相比，后者混合了量化卷积和量化自注意力模块。

English

In this paper, we show that a simple self-supervised pre-trained audio model can achieve comparable inference efficiency to more complicated pre-trained models with speech transformer encoders. These speech transformers rely on mixing convolutional modules with self-attention modules. They achieve state-of-the-art performance on ASR with top efficiency. We first show that employing these speech transformers as an encoder significantly improves the efficiency of pre-trained audio models as well. However, our study shows that we can achieve comparable efficiency with advanced self-attention solely. We demonstrate that this simpler approach is particularly beneficial with a low-bit weight quantization technique of a neural network to improve efficiency. We hypothesize that it prevents propagating the errors between different quantized modules compared to recent speech transformers mixing quantized convolution and the quantized self-attention modules.

注意力还是卷积：用于推断效率的音频语言模型中的Transformer编码器

Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency

摘要

Support