인접 토큰 병합을 통한 변환기 가속화

초록

최근의 종단간 자동 음성 인식(ASR) 시스템은 종종 높은 프레임 속도로 임베딩을 생성하는 Transformer 기반의 음향 인코더를 활용한다. 그러나 이러한 설계는 자기 주의(self-attention)의 이차 계산으로 인해, 특히 긴 음성 신호에 대해 비효율적이다. 이를 해결하기 위해, 우리는 인접한 토큰을 키 값 간의 높은 유사도 점수를 기반으로 점진적으로 결합하는 새로운 방법인 Adjacent Token Merging(A-ToMe)을 제안한다. 이를 통해 전체 시간 단계를 줄일 수 있으며, 인코더와 결합 네트워크의 추론 속도를 가속화할 수 있다. LibriSpeech에서의 실험 결과, 우리의 방법은 토큰의 57%를 줄이고 GPU에서의 추론 속도를 70% 향상시키면서도 정확도의 유의미한 손실 없이 이를 달성할 수 있음을 보여준다. 또한, A-ToMe는 다중 발화로 구성된 장문 ASR에서도 토큰을 줄이는 효과적인 해결책임을 입증한다.

English

Recent end-to-end automatic speech recognition (ASR) systems often utilize a Transformer-based acoustic encoder that generates embedding at a high frame rate. However, this design is inefficient, particularly for long speech signals due to the quadratic computation of self-attention. To address this, we propose a new method, Adjacent Token Merging (A-ToMe), which gradually combines adjacent tokens with high similarity scores between their key values. In this way, the total time step could be reduced, and the inference of both the encoder and joint network is accelerated. Experiments on LibriSpeech show that our method can reduce 57% of tokens and improve the inference speed on GPU by 70% without any notable loss of accuracy. Additionally, we demonstrate that A-ToMe is also an effective solution to reduce tokens in long-form ASR, where the input speech consists of multiple utterances.

인접 토큰 병합을 통한 변환기 가속화

Accelerating Transducers through Adjacent Token Merging

초록

Support