保護視覺語言模型：減輕基於擾動攻擊中高斯噪聲的脆弱性

摘要

視覺語言模型（VLMs）通過整合視覺信息擴展了大型語言模型（LLMs）的能力，但在處理噪聲或損壞圖像時仍易受越獄攻擊。儘管現有的VLMs在訓練過程中採取了安全措施來減輕此類攻擊，但與噪聲增強視覺輸入相關的漏洞卻被忽視了。在本研究中，我們發現缺乏噪聲增強訓練導致了關鍵的安全漏洞：許多VLMs甚至對簡單的擾動（如高斯噪聲）也表現出脆弱性。為應對這一挑戰，我們提出了Robust-VLGuard，這是一個包含對齊/非對齊圖像-文本對的多模態安全數據集，結合噪聲增強微調，在保持VLM功能的同時降低了攻擊成功率。針對更強的基於優化的視覺擾動攻擊，我們提出了DiffPure-VLM，利用擴散模型將對抗性擾動轉化為類似高斯的噪聲，從而可由經過噪聲增強安全微調的VLMs進行防禦。實驗結果表明，擴散模型的分佈轉移特性與我們微調後的VLMs高度契合，顯著減輕了不同強度下的對抗性擾動。數據集和代碼可在https://github.com/JarvisUSTC/DiffPure-RobustVLM獲取。

English

Vision-Language Models (VLMs) extend the capabilities of Large Language Models (LLMs) by incorporating visual information, yet they remain vulnerable to jailbreak attacks, especially when processing noisy or corrupted images. Although existing VLMs adopt security measures during training to mitigate such attacks, vulnerabilities associated with noise-augmented visual inputs are overlooked. In this work, we identify that missing noise-augmented training causes critical security gaps: many VLMs are susceptible to even simple perturbations such as Gaussian noise. To address this challenge, we propose Robust-VLGuard, a multimodal safety dataset with aligned / misaligned image-text pairs, combined with noise-augmented fine-tuning that reduces attack success rates while preserving functionality of VLM. For stronger optimization-based visual perturbation attacks, we propose DiffPure-VLM, leveraging diffusion models to convert adversarial perturbations into Gaussian-like noise, which can be defended by VLMs with noise-augmented safety fine-tuning. Experimental results demonstrate that the distribution-shifting property of diffusion model aligns well with our fine-tuned VLMs, significantly mitigating adversarial perturbations across varying intensities. The dataset and code are available at https://github.com/JarvisUSTC/DiffPure-RobustVLM.

保護視覺語言模型：減輕基於擾動攻擊中高斯噪聲的脆弱性

Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks

摘要

Support