視覺-語言模態平衡：觀察梯度，修復視覺缺陷以緩解主導模態偏差

摘要

視覺-語言（VL）模型在多種任務中展現了強大的性能。然而，這些模型通常依賴於特定模態進行預測，導致了「主導模態偏差」。這種偏差顯著影響了模型性能，尤其是在某一模態受損時。在本研究中，我們分析了主導模態偏差下的模型行為，並從理論上證明了未對齊的梯度或梯度幅度的差異阻礙了損失的平衡收斂。基於這些發現，我們提出了一個新框架——BalGrad，以減輕主導模態偏差。我們的方法包括模態間梯度重加權、根據各模態的貢獻調整KL散度的梯度，以及模態間梯度投影以非衝突的方式對齊任務方向。在UPMC Food-101、Hateful Memes和MM-IMDb數據集上的實驗證實，BalGrad在進行預測時有效緩解了對特定模態的過度依賴。

English

Vision-language (VL) models have demonstrated strong performance across various tasks. However, these models often rely on a specific modality for predictions, leading to "dominant modality bias.'' This bias significantly hurts performance, especially when one modality is impaired. In this study, we analyze model behavior under dominant modality bias and theoretically show that unaligned gradients or differences in gradient magnitudes prevent balanced convergence of the loss. Based on these findings, we propose a novel framework, BalGrad to mitigate dominant modality bias. Our approach includes inter-modality gradient reweighting, adjusting the gradient of KL divergence based on each modality's contribution, and inter-task gradient projection to align task directions in a non-conflicting manner. Experiments on UPMC Food-101, Hateful Memes, and MM-IMDb datasets confirm that BalGrad effectively alleviates over-reliance on specific modalities when making predictions.

視覺-語言模態平衡：觀察梯度，修復視覺缺陷以緩解主導模態偏差

See-Saw Modality Balance: See Gradient, and Sew Impaired Vision-Language Balance to Mitigate Dominant Modality Bias

摘要

Support