視覺-語言模態平衡:觀察梯度,修復視覺缺陷以緩解主導模態偏差
See-Saw Modality Balance: See Gradient, and Sew Impaired Vision-Language Balance to Mitigate Dominant Modality Bias
March 18, 2025
作者: JuneHyoung Kwon, MiHyeon Kim, Eunju Lee, Juhwan Choi, YoungBin Kim
cs.AI
摘要
視覺-語言(VL)模型在多種任務中展現了強大的性能。然而,這些模型通常依賴於特定模態進行預測,導致了「主導模態偏差」。這種偏差顯著影響了模型性能,尤其是在某一模態受損時。在本研究中,我們分析了主導模態偏差下的模型行為,並從理論上證明了未對齊的梯度或梯度幅度的差異阻礙了損失的平衡收斂。基於這些發現,我們提出了一個新框架——BalGrad,以減輕主導模態偏差。我們的方法包括模態間梯度重加權、根據各模態的貢獻調整KL散度的梯度,以及模態間梯度投影以非衝突的方式對齊任務方向。在UPMC Food-101、Hateful Memes和MM-IMDb數據集上的實驗證實,BalGrad在進行預測時有效緩解了對特定模態的過度依賴。
English
Vision-language (VL) models have demonstrated strong performance across
various tasks. However, these models often rely on a specific modality for
predictions, leading to "dominant modality bias.'' This bias significantly
hurts performance, especially when one modality is impaired. In this study, we
analyze model behavior under dominant modality bias and theoretically show that
unaligned gradients or differences in gradient magnitudes prevent balanced
convergence of the loss. Based on these findings, we propose a novel framework,
BalGrad to mitigate dominant modality bias. Our approach includes
inter-modality gradient reweighting, adjusting the gradient of KL divergence
based on each modality's contribution, and inter-task gradient projection to
align task directions in a non-conflicting manner. Experiments on UPMC
Food-101, Hateful Memes, and MM-IMDb datasets confirm that BalGrad effectively
alleviates over-reliance on specific modalities when making predictions.Summary
AI-Generated Summary