交叉注意力机制是语音转文本模型中的半壁江山。

摘要

交叉注意力是编码器-解码器架构中的核心机制，广泛应用于包括语音转文本（S2T）处理在内的多个领域。其得分已被重新用于各种下游应用——如时间戳估计和音频文本对齐——基于其反映了输入语音表示与生成文本之间依赖关系的假设。尽管在更广泛的自然语言处理（NLP）文献中，注意力机制的解释性已引发广泛讨论，但这一假设在语音领域仍鲜有探索。为填补这一空白，我们通过将交叉注意力得分与源自特征归因的输入显著性图进行比较，评估了S2T模型中交叉注意力的解释能力。我们的分析涵盖了单语和多语、单任务和多任务模型，并在多个尺度上展开，结果表明，注意力得分与基于显著性的解释存在中度到高度的对齐，尤其是在跨头跨层聚合时。然而，研究也显示，交叉注意力仅捕捉了约50%的输入相关性，在最佳情况下，也仅部分反映了解码器如何关注编码器的表示——仅占显著性的52-75%。这些发现揭示了将交叉注意力作为解释性代理的根本局限性，表明它虽提供了S2T模型预测驱动因素的有益视角，却并不完整。

English

Cross-attention is a core mechanism in encoder-decoder architectures, widespread in many fields, including speech-to-text (S2T) processing. Its scores have been repurposed for various downstream applications--such as timestamp estimation and audio-text alignment--under the assumption that they reflect the dependencies between input speech representation and the generated text. While the explanatory nature of attention mechanisms has been widely debated in the broader NLP literature, this assumption remains largely unexplored within the speech domain. To address this gap, we assess the explanatory power of cross-attention in S2T models by comparing its scores to input saliency maps derived from feature attribution. Our analysis spans monolingual and multilingual, single-task and multi-task models at multiple scales, and shows that attention scores moderately to strongly align with saliency-based explanations, particularly when aggregated across heads and layers. However, it also shows that cross-attention captures only about 50% of the input relevance and, in the best case, only partially reflects how the decoder attends to the encoder's representations--accounting for just 52-75% of the saliency. These findings uncover fundamental limitations in interpreting cross-attention as an explanatory proxy, suggesting that it offers an informative yet incomplete view of the factors driving predictions in S2T models.

交叉注意力机制是语音转文本模型中的半壁江山。

Cross-Attention is Half Explanation in Speech-to-Text Models

摘要

Support