ChatPaper.aiChatPaper

减少对功能词的关注以提升视觉语言模型的免费鲁棒性

Pay Less Attention to Function Words for Free Robustness of Vision-Language Models

December 8, 2025
作者: Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Chao Shen
cs.AI

摘要

为平衡鲁棒视觉语言模型(VLM)的鲁棒性与性能矛盾,我们发现功能词会导致VLM在跨模态对抗攻击下表现脆弱,据此提出功能词去注意力机制(FDA)以削弱功能词的影响。该机制仿效差分放大器原理,在注意力头中分别计算原始跨注意力与功能词跨注意力,通过差分消减后者来增强VLM的对齐能力与鲁棒性。综合实验涵盖2个下游任务、3个数据集和3种模型上的6种攻击测试及2个前沿基线模型。总体而言,FDA在检索任务中使3个测试模型的攻击成功率平均下降18/13/53%,性能仅损失0.2/0.3/0.6%;在视觉定位任务中实现90%的攻击成功率降幅,同时性能提升0.3%。我们通过实验验证了FDA的可扩展性、泛化性和零样本性能,并进行了深入的消融研究与分析。代码将公开于https://github.com/michaeltian108/FDA。
English
To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the impact of function words. Similar to differential amplifiers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more aligned and robust VLMs. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53% ASR drop with only 0.2/0.3/0.6% performance drops on the 3 tested models on retrieval, and a 90% ASR drop with a 0.3% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot performance of FDA experimentally, as well as in-depth ablation studies and analysis. Code will be made publicly at https://github.com/michaeltian108/FDA.
PDF12December 13, 2025