See-Sawモダリティバランス：勾配を確認し、視覚と言語の不均衡を縫い合わせて主要モダリティバイアスを軽減する

要旨

視覚言語（VL）モデルは、様々なタスクにおいて高い性能を発揮することが実証されています。しかし、これらのモデルは予測において特定のモダリティに依存することが多く、「支配的モダリティバイアス」を引き起こします。このバイアスは、特に一方のモダリティが損なわれた場合に、性能を著しく低下させます。本研究では、支配的モダリティバイアス下でのモデルの挙動を分析し、勾配の非整合性や勾配の大きさの違いが損失の均衡収束を妨げることを理論的に示します。これらの知見に基づき、支配的モダリティバイアスを軽減するための新しいフレームワーク、BalGradを提案します。我々のアプローチは、モダリティ間の勾配再重み付け、各モダリティの貢献に基づくKLダイバージェンスの勾配調整、およびタスク方向を非衝突的に整合させるためのタスク間勾配射影を含みます。UPMC Food-101、Hateful Memes、MM-IMDbデータセットでの実験により、BalGradが予測時の特定のモダリティへの過度な依存を効果的に緩和することが確認されました。

English

Vision-language (VL) models have demonstrated strong performance across various tasks. However, these models often rely on a specific modality for predictions, leading to "dominant modality bias.'' This bias significantly hurts performance, especially when one modality is impaired. In this study, we analyze model behavior under dominant modality bias and theoretically show that unaligned gradients or differences in gradient magnitudes prevent balanced convergence of the loss. Based on these findings, we propose a novel framework, BalGrad to mitigate dominant modality bias. Our approach includes inter-modality gradient reweighting, adjusting the gradient of KL divergence based on each modality's contribution, and inter-task gradient projection to align task directions in a non-conflicting manner. Experiments on UPMC Food-101, Hateful Memes, and MM-IMDb datasets confirm that BalGrad effectively alleviates over-reliance on specific modalities when making predictions.

See-Sawモダリティバランス：勾配を確認し、視覚と言語の不均衡を縫い合わせて主要モダリティバイアスを軽減する

See-Saw Modality Balance: See Gradient, and Sew Impaired Vision-Language Balance to Mitigate Dominant Modality Bias

要旨

Support