Dr. SHAP-AV: 샤플리 귀인 분석을 통한 오디오-비주얼 음성 인식에서의 상대적 양태 기여도 해석

초록

오디오-비주얼 음성 인식(AVSR)은 잡음 환경에서 강인한 인식을 위해 청각 및 시각 정보를 모두 활용합니다. 그러나 모델이 이러한 양상을 어떻게 균형 있게 활용하는지는 여전히 명확하지 않습니다. 본 연구에서는 AVSR에서 양상 기여도를 분석하기 위해 Shapley 값을 활용한 Dr. SHAP-AV 프레임워크를 제시합니다. 두 개의 벤치마크와 다양한 SNR 수준에서 6개 모델을 대상으로 한 실험을 통해 세 가지 분석 방법을 소개합니다: 전체 양상 균형을 분석하는 Global SHAP, 디코딩 과정에서의 기여도 변화를 분석하는 Generative SHAP, 그리고 입력-출력 간 대응 관계를 분석하는 Temporal Alignment SHAP입니다. 우리의 연구 결과는 모델이 잡음 환경에서 시각 정보에 의존하는 방향으로 전환하지만, 심각한 음성 열화 상황에서도 오디오 기여도는 높게 유지된다는 것을 보여줍니다. 양상 간 균형은 생성 과정에서 변화하며, 시간적 정렬은 잡음 환경에서도 유지되고, SNR은 양상 가중치를 결정하는 주요 요인입니다. 이러한 결과는 지속적인 오디오 편향을 드러내며, 특정 상황에 맞는 양상 가중치 조정 메커니즘과 Shapley 기반 귀속 분석을 표준 AVSR 진단 도구로 활용할 필요성을 시사합니다.

English

Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual information for robust recognition under noise. However, how models balance these modalities remains unclear. We present Dr. SHAP-AV, a framework using Shapley values to analyze modality contributions in AVSR. Through experiments on six models across two benchmarks and varying SNR levels, we introduce three analyses: Global SHAP for overall modality balance, Generative SHAP for contribution dynamics during decoding, and Temporal Alignment SHAP for input-output correspondence. Our findings reveal that models shift toward visual reliance under noise yet maintain high audio contributions even under severe degradation. Modality balance evolves during generation, temporal alignment holds under noise, and SNR is the dominant factor driving modality weighting. These findings expose a persistent audio bias, motivating ad-hoc modality-weighting mechanisms and Shapley-based attribution as a standard AVSR diagnostic.

Dr. SHAP-AV: 샤플리 귀인 분석을 통한 오디오-비주얼 음성 인식에서의 상대적 양태 기여도 해석

Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition

초록

Support