隣接トークン統合によるトランスデューサの高速化

要旨

近年のエンドツーエンド自動音声認識（ASR）システムでは、高フレームレートで埋め込みを生成するTransformerベースの音響エンコーダがよく使用されています。しかし、この設計は、特に長い音声信号に対して、自己注意の二次計算のため非効率的です。この問題に対処するため、我々は新しい手法であるAdjacent Token Merging（A-ToMe）を提案します。この手法では、キー値間の類似度スコアが高い隣接トークンを段階的に結合します。これにより、総タイムステップを削減し、エンコーダと結合ネットワークの推論を加速します。LibriSpeechでの実験では、この手法により57%のトークンを削減し、GPU上の推論速度を70%向上させることができ、精度の顕著な低下はありませんでした。さらに、A-ToMeは、入力音声が複数の発話からなる長文ASRにおいても、トークンを削減する効果的な解決策であることを示します。

English

Recent end-to-end automatic speech recognition (ASR) systems often utilize a Transformer-based acoustic encoder that generates embedding at a high frame rate. However, this design is inefficient, particularly for long speech signals due to the quadratic computation of self-attention. To address this, we propose a new method, Adjacent Token Merging (A-ToMe), which gradually combines adjacent tokens with high similarity scores between their key values. In this way, the total time step could be reduced, and the inference of both the encoder and joint network is accelerated. Experiments on LibriSpeech show that our method can reduce 57% of tokens and improve the inference speed on GPU by 70% without any notable loss of accuracy. Additionally, we demonstrate that A-ToMe is also an effective solution to reduce tokens in long-form ASR, where the input speech consists of multiple utterances.

隣接トークン統合によるトランスデューサの高速化

Accelerating Transducers through Adjacent Token Merging

要旨

Support