주의 또는 컨볼루션: 추론 효율성을 위한 오디오 언어 모델의 트랜스포머 인코더

초록

본 논문에서는 간단한 자기 지도 사전 학습 오디오 모델이 복잡한 사전 학습 모델과 비슷한 추론 효율성을 달성할 수 있음을 보여줍니다. 이러한 복잡한 모델은 음성 트랜스포머 인코더를 사용하며, 합성곱 모듈과 자기 주의 모듈을 혼합하는 방식으로 동작합니다. 이들은 ASR(자동 음성 인식)에서 최고 수준의 효율성과 성능을 달성합니다. 우리는 먼저 이러한 음성 트랜스포머를 인코더로 사용할 경우 사전 학습 오디오 모델의 효율성도 크게 향상됨을 보입니다. 그러나 연구 결과, 고급 자기 주의 모듈만으로도 비슷한 효율성을 달성할 수 있음을 확인했습니다. 우리는 이 간단한 접근 방식이 신경망의 저비트 가중치 양자화 기술과 결합될 때 특히 유리하다는 점을 입증합니다. 이는 최근의 음성 트랜스포머가 양자화된 합성곱과 양자화된 자기 주의 모듈을 혼합하는 방식과 비교하여, 서로 다른 양자화된 모듈 간의 오류 전파를 방지한다는 가설을 뒷받침합니다.

English

In this paper, we show that a simple self-supervised pre-trained audio model can achieve comparable inference efficiency to more complicated pre-trained models with speech transformer encoders. These speech transformers rely on mixing convolutional modules with self-attention modules. They achieve state-of-the-art performance on ASR with top efficiency. We first show that employing these speech transformers as an encoder significantly improves the efficiency of pre-trained audio models as well. However, our study shows that we can achieve comparable efficiency with advanced self-attention solely. We demonstrate that this simpler approach is particularly beneficial with a low-bit weight quantization technique of a neural network to improve efficiency. We hypothesize that it prevents propagating the errors between different quantized modules compared to recent speech transformers mixing quantized convolution and the quantized self-attention modules.

주의 또는 컨볼루션: 추론 효율성을 위한 오디오 언어 모델의 트랜스포머 인코더

Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency

초록

Support