ChatPaper.aiChatPaper

多模态大语言模型中的敏感信息遗忘:基准测试与攻防评估

Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation

May 1, 2025
作者: Vaidehi Patil, Yi-Lin Sung, Peter Hase, Jie Peng, Tianlong Chen, Mohit Bansal
cs.AI

摘要

基于海量数据集训练的大型语言模型(LLMs)可能无意中获取敏感信息,如个人详情及潜在有害内容。这一风险在多模态LLMs中进一步加剧,因为它们整合了来自多种模态(图像与文本)的信息。攻击者可通过多模态提示利用这些知识提取敏感细节。评估多模态LLMs如何有效遗忘此类信息(定向遗忘),需要构建高质量、标注完善的图文对。尽管先前关于遗忘的研究集中于文本领域,多模态遗忘仍待深入探索。为填补这一空白,我们首先引入了一个多模态遗忘基准——UnLOK-VQA(遗忘外部知识视觉问答),以及一个攻击与防御框架,用于评估从多模态LLMs中删除特定多模态知识的方法。我们利用自动化流程扩展了一个视觉问答数据集,生成不同接近度的样本以测试泛化性与特异性,随后通过人工筛选确保高质量。接着,我们针对七种攻击(四种白盒,三种黑盒)评估了六种防御目标,包括一种利用隐藏状态可解释性的新颖白盒方法。结果显示,多模态攻击优于仅文本或仅图像的攻击,而最有效的防御措施是从模型内部状态移除答案信息。此外,更大模型展现出更强的编辑后鲁棒性,表明规模提升安全性。UnLOK-VQA为推进多模态LLMs的遗忘研究提供了一个严谨的基准。
English
LLMs trained on massive datasets may inadvertently acquire sensitive information such as personal details and potentially harmful content. This risk is further heightened in multimodal LLMs as they integrate information from multiple modalities (image and text). Adversaries can exploit this knowledge through multimodal prompts to extract sensitive details. Evaluating how effectively MLLMs can forget such information (targeted unlearning) necessitates the creation of high-quality, well-annotated image-text pairs. While prior work on unlearning has focused on text, multimodal unlearning remains underexplored. To address this gap, we first introduce a multimodal unlearning benchmark, UnLOK-VQA (Unlearning Outside Knowledge VQA), as well as an attack-and-defense framework to evaluate methods for deleting specific multimodal knowledge from MLLMs. We extend a visual question-answering dataset using an automated pipeline that generates varying-proximity samples for testing generalization and specificity, followed by manual filtering for maintaining high quality. We then evaluate six defense objectives against seven attacks (four whitebox, three blackbox), including a novel whitebox method leveraging interpretability of hidden states. Our results show multimodal attacks outperform text- or image-only ones, and that the most effective defense removes answer information from internal model states. Additionally, larger models exhibit greater post-editing robustness, suggesting that scale enhances safety. UnLOK-VQA provides a rigorous benchmark for advancing unlearning in MLLMs.

Summary

AI-Generated Summary

PDF21May 6, 2025