通過相鄰標記合併加速傳感器

摘要

近期的端到端自動語音識別（ASR）系統通常使用基於Transformer的聲學編碼器，以高幀率生成嵌入。然而，這種設計對於長語音信號來說效率低下，因為自注意力的二次計算。為了解決這個問題，我們提出了一種新方法，稱為相鄰標記合併（A-ToMe），逐步結合具有高相似性分數的鍵值之間的相鄰標記。通過這種方式，總時間步驟可以減少，並加速編碼器和聯合網絡的推理。在LibriSpeech上的實驗表明，我們的方法可以減少57%的標記，並在GPU上將推理速度提高了70%，而不會明顯損失準確性。此外，我們還展示了A-ToMe也是減少長篇ASR中標記的有效解決方案，其中輸入語音包含多個發話。

English

Recent end-to-end automatic speech recognition (ASR) systems often utilize a Transformer-based acoustic encoder that generates embedding at a high frame rate. However, this design is inefficient, particularly for long speech signals due to the quadratic computation of self-attention. To address this, we propose a new method, Adjacent Token Merging (A-ToMe), which gradually combines adjacent tokens with high similarity scores between their key values. In this way, the total time step could be reduced, and the inference of both the encoder and joint network is accelerated. Experiments on LibriSpeech show that our method can reduce 57% of tokens and improve the inference speed on GPU by 70% without any notable loss of accuracy. Additionally, we demonstrate that A-ToMe is also an effective solution to reduce tokens in long-form ASR, where the input speech consists of multiple utterances.

通過相鄰標記合併加速傳感器

Accelerating Transducers through Adjacent Token Merging

摘要

Support