通过相邻标记合并加速传感器

摘要

最近的端到端自动语音识别（ASR）系统通常采用基于Transformer的声学编码器，以高帧率生成嵌入。然而，由于自注意力计算的二次复杂度，这种设计在处理长语音信号时效率低下。为了解决这个问题，我们提出了一种新方法，即相邻标记合并（A-ToMe），逐渐结合具有高相似性分数的相邻标记及其关键值。通过这种方式，总时间步骤可以减少，并加快编码器和联合网络的推理速度。在LibriSpeech上的实验表明，我们的方法可以减少57%的标记，并在GPU上将推理速度提高了70%，而准确性几乎没有明显损失。此外，我们证明A-ToMe也是减少长篇ASR中标记的有效解决方案，其中输入语音包含多个话语。

English

Recent end-to-end automatic speech recognition (ASR) systems often utilize a Transformer-based acoustic encoder that generates embedding at a high frame rate. However, this design is inefficient, particularly for long speech signals due to the quadratic computation of self-attention. To address this, we propose a new method, Adjacent Token Merging (A-ToMe), which gradually combines adjacent tokens with high similarity scores between their key values. In this way, the total time step could be reduced, and the inference of both the encoder and joint network is accelerated. Experiments on LibriSpeech show that our method can reduce 57% of tokens and improve the inference speed on GPU by 70% without any notable loss of accuracy. Additionally, we demonstrate that A-ToMe is also an effective solution to reduce tokens in long-form ASR, where the input speech consists of multiple utterances.

通过相邻标记合并加速传感器

Accelerating Transducers through Adjacent Token Merging

摘要

Support