多模態大語言模型中敏感資訊的遺忘:基準與攻防評估
Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation
May 1, 2025
作者: Vaidehi Patil, Yi-Lin Sung, Peter Hase, Jie Peng, Tianlong Chen, Mohit Bansal
cs.AI
摘要
基於海量數據集訓練的大型語言模型(LLMs)可能無意中獲取敏感信息,如個人細節和潛在有害內容。這種風險在多模態LLMs中進一步加劇,因為它們整合了來自多種模態(圖像和文本)的信息。攻擊者可以通過多模態提示來利用這些知識,提取敏感細節。評估多模態LLMs如何有效遺忘此類信息(針對性遺忘)需要創建高質量、註釋良好的圖像-文本對。雖然先前關於遺忘的研究主要集中在文本上,但多模態遺忘仍未被充分探索。為填補這一空白,我們首先引入了一個多模態遺忘基準,UnLOK-VQA(遺忘外部知識的視覺問答),以及一個攻擊與防禦框架,用於評估從多模態LLMs中刪除特定多模態知識的方法。我們使用自動化管道擴展了一個視覺問答數據集,生成不同接近度的樣本以測試泛化性和特異性,隨後進行手動過濾以保持高質量。然後,我們針對七種攻擊(四種白盒,三種黑盒)評估了六種防禦目標,包括一種利用隱藏狀態可解釋性的新穎白盒方法。我們的結果顯示,多模態攻擊優於僅文本或僅圖像的攻擊,而最有效的防禦方法從模型內部狀態中移除答案信息。此外,更大的模型在後編輯中表現出更高的魯棒性,表明規模增強了安全性。UnLOK-VQA為推進多模態LLMs中的遺忘研究提供了一個嚴格的基準。
English
LLMs trained on massive datasets may inadvertently acquire sensitive
information such as personal details and potentially harmful content. This risk
is further heightened in multimodal LLMs as they integrate information from
multiple modalities (image and text). Adversaries can exploit this knowledge
through multimodal prompts to extract sensitive details. Evaluating how
effectively MLLMs can forget such information (targeted unlearning)
necessitates the creation of high-quality, well-annotated image-text pairs.
While prior work on unlearning has focused on text, multimodal unlearning
remains underexplored. To address this gap, we first introduce a multimodal
unlearning benchmark, UnLOK-VQA (Unlearning Outside Knowledge VQA), as well as
an attack-and-defense framework to evaluate methods for deleting specific
multimodal knowledge from MLLMs. We extend a visual question-answering dataset
using an automated pipeline that generates varying-proximity samples for
testing generalization and specificity, followed by manual filtering for
maintaining high quality. We then evaluate six defense objectives against seven
attacks (four whitebox, three blackbox), including a novel whitebox method
leveraging interpretability of hidden states. Our results show multimodal
attacks outperform text- or image-only ones, and that the most effective
defense removes answer information from internal model states. Additionally,
larger models exhibit greater post-editing robustness, suggesting that scale
enhances safety. UnLOK-VQA provides a rigorous benchmark for advancing
unlearning in MLLMs.Summary
AI-Generated Summary