SHAP-AV博士:基於夏普利值歸因解碼音視覺語音識別中的模態貢獻度
Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition
March 12, 2026
作者: Umberto Cappellazzo, Stavros Petridis, Maja Pantic
cs.AI
摘要
视听语音识别(AVSR)通过融合声学与视觉信息实现噪声环境下的鲁棒识别。然而,模型如何平衡这两种模态仍不明确。我们提出Dr. SHAP-AV框架,利用沙普利值分析AVSR中的模态贡献度。通过对两个基准数据集、六种模型在不同信噪比条件下的实验,我们引入三种分析:全局SHAP(评估整体模态平衡)、生成式SHAP(解析解码过程中的贡献动态)和时间对齐SHAP(探究输入输出对应关系)。研究发现:模型在噪声下会转向视觉依赖,但即使音频严重退化仍保持较高贡献度;模态平衡在生成过程中动态演化;时间对齐性在噪声下依然成立;信噪比是驱动模态权重分配的主导因素。这些发现揭示了模型存在持续性音频偏好,启示未来需设计自适应模态加权机制,并将基于沙普利值的归因分析作为标准化的AVSR诊断工具。
English
Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual information for robust recognition under noise. However, how models balance these modalities remains unclear. We present Dr. SHAP-AV, a framework using Shapley values to analyze modality contributions in AVSR. Through experiments on six models across two benchmarks and varying SNR levels, we introduce three analyses: Global SHAP for overall modality balance, Generative SHAP for contribution dynamics during decoding, and Temporal Alignment SHAP for input-output correspondence. Our findings reveal that models shift toward visual reliance under noise yet maintain high audio contributions even under severe degradation. Modality balance evolves during generation, temporal alignment holds under noise, and SNR is the dominant factor driving modality weighting. These findings expose a persistent audio bias, motivating ad-hoc modality-weighting mechanisms and Shapley-based attribution as a standard AVSR diagnostic.