注意機構または畳み込み：推論効率のための音声言語モデルにおけるTransformerエンコーダ

要旨

本論文では、シンプルな自己教師あり事前学習オーディオモデルが、より複雑な事前学習モデルと同等の推論効率を達成できることを示す。これらの複雑なモデルは、畳み込みモジュールと自己注意モジュールを組み合わせた音声トランスフォーマーエンコーダーを採用しており、ASR（自動音声認識）において最高の効率で最先端の性能を達成している。まず、これらの音声トランスフォーマーをエンコーダーとして使用することで、事前学習オーディオモデルの効率も大幅に向上することを示す。しかし、我々の研究では、高度な自己注意機構のみを用いても同等の効率を達成できることを明らかにした。このシンプルなアプローチは、ニューラルネットワークの低ビット重み量子化技術と組み合わせることで特に有効であり、効率の向上に寄与することを実証する。我々は、量子化された畳み込みモジュールと量子化された自己注意モジュールを混合する最近の音声トランスフォーマーと比較して、異なる量子化モジュール間の誤差伝播を防ぐ効果があると仮説を立てている。

English

In this paper, we show that a simple self-supervised pre-trained audio model can achieve comparable inference efficiency to more complicated pre-trained models with speech transformer encoders. These speech transformers rely on mixing convolutional modules with self-attention modules. They achieve state-of-the-art performance on ASR with top efficiency. We first show that employing these speech transformers as an encoder significantly improves the efficiency of pre-trained audio models as well. However, our study shows that we can achieve comparable efficiency with advanced self-attention solely. We demonstrate that this simpler approach is particularly beneficial with a low-bit weight quantization technique of a neural network to improve efficiency. We hypothesize that it prevents propagating the errors between different quantized modules compared to recent speech transformers mixing quantized convolution and the quantized self-attention modules.

注意機構または畳み込み：推論効率のための音声言語モデルにおけるTransformerエンコーダ

Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency

要旨

Support