视觉-语言模态平衡:通过梯度观察与视觉受损缝合,缓解主导模态偏差
See-Saw Modality Balance: See Gradient, and Sew Impaired Vision-Language Balance to Mitigate Dominant Modality Bias
March 18, 2025
作者: JuneHyoung Kwon, MiHyeon Kim, Eunju Lee, Juhwan Choi, YoungBin Kim
cs.AI
摘要
视觉-语言(VL)模型在多种任务中展现了卓越的性能。然而,这些模型往往依赖特定模态进行预测,导致“主导模态偏差”。这种偏差显著损害了模型性能,尤其是在某一模态受损时。本研究分析了主导模态偏差下的模型行为,并从理论上证明了未对齐的梯度或梯度幅度的差异阻碍了损失的平衡收敛。基于这些发现,我们提出了一种新颖的框架——BalGrad,以减轻主导模态偏差。我们的方法包括跨模态梯度重加权,根据各模态的贡献调整KL散度的梯度,以及跨任务梯度投影,以非冲突的方式对齐任务方向。在UPMC Food-101、Hateful Memes和MM-IMDb数据集上的实验证实,BalGrad有效缓解了预测时对特定模态的过度依赖。
English
Vision-language (VL) models have demonstrated strong performance across
various tasks. However, these models often rely on a specific modality for
predictions, leading to "dominant modality bias.'' This bias significantly
hurts performance, especially when one modality is impaired. In this study, we
analyze model behavior under dominant modality bias and theoretically show that
unaligned gradients or differences in gradient magnitudes prevent balanced
convergence of the loss. Based on these findings, we propose a novel framework,
BalGrad to mitigate dominant modality bias. Our approach includes
inter-modality gradient reweighting, adjusting the gradient of KL divergence
based on each modality's contribution, and inter-task gradient projection to
align task directions in a non-conflicting manner. Experiments on UPMC
Food-101, Hateful Memes, and MM-IMDb datasets confirm that BalGrad effectively
alleviates over-reliance on specific modalities when making predictions.Summary
AI-Generated Summary