少关注功能词以提升视觉语言模型的免费鲁棒性
Pay Less Attention to Function Words for Free Robustness of Vision-Language Models
December 8, 2025
作者: Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Chao Shen
cs.AI
摘要
为应对鲁棒视觉语言模型(VLM)中鲁棒性与性能的权衡问题,我们发现功能词会导致VLM在跨模态对抗攻击下表现脆弱,据此提出功能词去注意力机制(FDA)以削弱功能词的影响。该机制仿效差分放大器原理,在注意力头内分别计算原始跨注意力与功能词跨注意力,并通过差分消减法削弱后者影响,从而提升VLM的对齐能力与鲁棒性。我们在2个下游任务、3个数据集和3种模型上,针对6种不同攻击方式开展了包含2个前沿基线模型的综合实验。总体而言,在检索任务中,FDA使3个测试模型的攻击成功率平均下降18%/13%/53%,性能损失仅为0.2%/0.3%/0.6%;在视觉定位任务中实现90%的攻击成功率降幅,同时获得0.3%的性能提升。实验从可扩展性、泛化性和零样本性能三个维度验证了FDA的有效性,并辅以深入的消融研究与分析。代码已公开于https://github.com/michaeltian108/FDA。
English
To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the impact of function words. Similar to differential amplifiers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more aligned and robust VLMs. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53% ASR drop with only 0.2/0.3/0.6% performance drops on the 3 tested models on retrieval, and a 90% ASR drop with a 0.3% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot performance of FDA experimentally, as well as in-depth ablation studies and analysis. Code will be made publicly at https://github.com/michaeltian108/FDA.