跨注意力機制是語音轉文字模型中的半壁解釋

摘要

跨注意力機制是編碼器-解碼器架構中的核心機制，廣泛應用於語音轉文字（S2T）處理等多個領域。其分數已被重新用於各種下游應用——例如時間戳估計和音頻-文本對齊——基於其反映了輸入語音表示與生成文本之間依賴關係的假設。儘管在更廣泛的自然語言處理（NLP）文獻中，注意力機制的解釋性質已受到廣泛討論，但這一假設在語音領域仍大多未經探索。為填補這一空白，我們通過將跨注意力分數與源自特徵歸因的輸入顯著性圖進行比較，評估了S2T模型中跨注意力的解釋能力。我們的分析涵蓋了單語和多語、單任務和多任務模型的多個規模，結果顯示注意力分數與基於顯著性的解釋中度至高度一致，尤其是在跨頭部和層級聚合時。然而，分析也表明跨注意力僅捕捉了約50%的輸入相關性，並且在最佳情況下，僅部分反映了解碼器如何關注編碼器的表示——僅佔顯著性的52-75%。這些發現揭示了將跨注意力作為解釋代理的基本局限性，表明其提供了推動S2T模型預測因素的有益但並不完整的視角。

English

Cross-attention is a core mechanism in encoder-decoder architectures, widespread in many fields, including speech-to-text (S2T) processing. Its scores have been repurposed for various downstream applications--such as timestamp estimation and audio-text alignment--under the assumption that they reflect the dependencies between input speech representation and the generated text. While the explanatory nature of attention mechanisms has been widely debated in the broader NLP literature, this assumption remains largely unexplored within the speech domain. To address this gap, we assess the explanatory power of cross-attention in S2T models by comparing its scores to input saliency maps derived from feature attribution. Our analysis spans monolingual and multilingual, single-task and multi-task models at multiple scales, and shows that attention scores moderately to strongly align with saliency-based explanations, particularly when aggregated across heads and layers. However, it also shows that cross-attention captures only about 50% of the input relevance and, in the best case, only partially reflects how the decoder attends to the encoder's representations--accounting for just 52-75% of the saliency. These findings uncover fundamental limitations in interpreting cross-attention as an explanatory proxy, suggesting that it offers an informative yet incomplete view of the factors driving predictions in S2T models.

跨注意力機制是語音轉文字模型中的半壁解釋

Cross-Attention is Half Explanation in Speech-to-Text Models

摘要

Support