利用大型语言模型缓解音视频语音识别中的注意力分散与激活激增问题
Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMS
October 26, 2025
作者: Anand, Umberto Cappellazzo, Stavros Petridis, Maja Pantic
cs.AI
摘要
近期,大型语言模型(LLMs)在听觉语音识别(ASR)、视觉语音识别(VSR)及视听语音识别(AVSR)领域取得了显著进展。然而,人们对微调过程中其内部动态机制的理解仍显不足。在自然语言处理领域,最新研究揭示了注意力汇聚现象(即某些词元会吸引异常高的注意力)以及与之相关的大规模激活现象(即汇聚词元的某些特征在LLMs中呈现巨量激活)。本研究首次针对多模态语音识别中的这些现象展开探讨。通过对视听LLMs的详细分析,我们在ASR、VSR和AVSR任务中不仅发现句首(BOS)词元存在注意力汇聚和大规模激活现象,还在中间低语义词元中识别出相同现象。研究表明,大规模激活现象源于MLP层,且所有汇聚词元均对应固定的特征索引。我们进一步发现中间汇聚词元与BOS词元具有极高的余弦相似度,从而放大了注意力汇聚和激活效应。基于这些发现,我们提出一种简单的去相关损失函数,通过降低BOS词元与其他词元间的余弦相似度,有效抑制中间汇聚现象和大规模激活。该方法在高视听特征降采样率下能显著降低词错误率(WER),同时在较低降采样率下保持性能稳定。
English
Large language models (LLMs) have recently advanced auditory speech
recognition (ASR), visual speech recognition (VSR), and audio-visual speech
recognition (AVSR). However, understanding of their internal dynamics under
fine-tuning remains limited. In natural language processing, recent work has
revealed attention sinks, tokens that attract disproportionately high
attention, and associated massive activations in which some features of sink
tokens exhibit huge activation in LLMs. In this work, we are the first to study
these phenomena in multimodal speech recognition. Through a detailed analysis
of audio-visual LLMs, we identify attention sinks and massive activations not
only at the BOS token but also at intermediate low-semantic tokens across ASR,
VSR, and AVSR. We show that massive activations originate in the MLP layers and
correspond to fixed feature indices across all sink tokens. We further show
that intermediate sink tokens exhibit high cosine similarity to the BOS token,
thereby amplifying attention and activation. Building on these insights, we
introduce a simple decorrelation loss that reduces cosine similarity between
BOS and other tokens, effectively mitigating intermediate sinks and massive
activations. Furthermore, our method improves word error rate (WER) under high
audio-visual feature downsampling while remaining stable at lower downsampling
rates.