SHAP-AV博士：基于沙普利值归因的视听语音识别模态贡献度解析

摘要

视听语音识别(AVSR)通过融合声学与视觉信息实现噪声环境下的鲁棒识别，但模型如何平衡多模态输入尚不明确。我们提出Dr. SHAP-AV框架，利用沙普利值解析AVSR中的模态贡献度。通过在两个基准数据集、不同信噪比条件下对六种模型进行实验，我们提出三种分析方法：全局SHAP揭示整体模态平衡，生成式SHAP展现解码过程中的贡献动态，时序对齐SHAP探究输入输出对应关系。实验表明：模型在噪声中会转向视觉依赖，但即便音频严重退化仍保持其高贡献度；模态平衡在生成过程中动态演化；时序对齐在噪声下依然成立；信噪比是驱动模态权重分配的主导因素。这些发现揭示了模型存在持续性的音频偏好，启示我们应设计自适应模态加权机制，并将基于沙普利值的归因分析作为标准化的AVSR诊断工具。

English

Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual information for robust recognition under noise. However, how models balance these modalities remains unclear. We present Dr. SHAP-AV, a framework using Shapley values to analyze modality contributions in AVSR. Through experiments on six models across two benchmarks and varying SNR levels, we introduce three analyses: Global SHAP for overall modality balance, Generative SHAP for contribution dynamics during decoding, and Temporal Alignment SHAP for input-output correspondence. Our findings reveal that models shift toward visual reliance under noise yet maintain high audio contributions even under severe degradation. Modality balance evolves during generation, temporal alignment holds under noise, and SNR is the dominant factor driving modality weighting. These findings expose a persistent audio bias, motivating ad-hoc modality-weighting mechanisms and Shapley-based attribution as a standard AVSR diagnostic.

SHAP-AV博士：基于沙普利值归因的视听语音识别模态贡献度解析

Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition

摘要

Support