SHAP-AV博士：音声視覚音声認識におけるシャプレー値帰属による相対的モダリティ寄与の解読

要旨

Audio-Visual Speech Recognition (AVSR) は、雑音下での頑健な認識のために音声情報と視覚情報の両方を活用する。しかし、モデルがこれらのモダリティをどのようにバランスさせるかは未だ不明である。本論文では、Shapley値を用いてAVSRにおけるモダリティ寄与を分析するフレームワーク、Dr. SHAP-AVを提案する。2つのベンチマークと様々なSNRレベルにわたる6つのモデルを用いた実験を通じて、3つの分析を導入する：全体的なモダリティバランスのためのGlobal SHAP、デコーディング中の寄与の動的変化を捉えるGenerative SHAP、そして入出力対応関係を分析するTemporal Alignment SHAPである。我々の知見は、モデルが雑音下では視覚への依存を強める一方、音声が深刻に劣化した条件下でも高い寄与を維持することを明らかにする。モダリティバランスは生成過程で変化し、時間的アライメントは雑音下でも保持され、SNRがモダリティの重み付けを駆動する支配的要因である。これらの知見は、モデルに持続的な音声バイアスが存在することを露呈し、アドホックなモダリティ重み付け機構の必要性、およびShapley値に基づく帰属分析を標準的なAVSR診断手法とする動機付けを提供する。

English

Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual information for robust recognition under noise. However, how models balance these modalities remains unclear. We present Dr. SHAP-AV, a framework using Shapley values to analyze modality contributions in AVSR. Through experiments on six models across two benchmarks and varying SNR levels, we introduce three analyses: Global SHAP for overall modality balance, Generative SHAP for contribution dynamics during decoding, and Temporal Alignment SHAP for input-output correspondence. Our findings reveal that models shift toward visual reliance under noise yet maintain high audio contributions even under severe degradation. Modality balance evolves during generation, temporal alignment holds under noise, and SNR is the dominant factor driving modality weighting. These findings expose a persistent audio bias, motivating ad-hoc modality-weighting mechanisms and Shapley-based attribution as a standard AVSR diagnostic.

SHAP-AV博士：音声視覚音声認識におけるシャプレー値帰属による相対的モダリティ寄与の解読

Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition

要旨

Support