マルチモーダルLLMにおける機密情報の忘却：ベンチマークと攻撃防御評価

要旨

大規模なデータセットで訓練されたLLM（大規模言語モデル）は、個人情報や潜在的に有害なコンテンツなどの機密情報を意図せず取得する可能性があります。このリスクは、マルチモーダルLLM（画像とテキストなど複数のモダリティを統合するモデル）においてさらに高まります。攻撃者は、マルチモーダルプロンプトを利用してこの知識を悪用し、機密情報を抽出することができます。MLLMがそのような情報を効果的に忘れる能力（ターゲット型アンラーニング）を評価するためには、高品質で適切に注釈付けされた画像-テキストペアの作成が必要です。これまでのアンラーニング研究はテキストに焦点を当ててきましたが、マルチモーダルアンラーニングはまだ十分に探求されていません。このギャップを埋めるため、我々はまずマルチモーダルアンラーニングのベンチマークであるUnLOK-VQA（Unlearning Outside Knowledge VQA）と、MLLMから特定のマルチモーダル知識を削除する手法を評価するための攻撃-防御フレームワークを導入します。視覚的質問応答データセットを拡張し、汎用性と特異性をテストするための近接度が異なるサンプルを自動生成するパイプラインを構築し、その後手動でフィルタリングを行い高品質を維持します。次に、7つの攻撃（4つのホワイトボックス、3つのブラックボックス）に対して6つの防御目標を評価します。これには、隠れ状態の解釈可能性を活用した新しいホワイトボックス手法も含まれます。結果は、マルチモーダル攻撃がテキストのみまたは画像のみの攻撃を上回り、最も効果的な防御は内部モデル状態から回答情報を削除するものであることを示しています。さらに、大規模なモデルは編集後の堅牢性が高く、スケールが安全性を向上させることを示唆しています。UnLOK-VQAは、MLLMにおけるアンラーニングの進展に向けた厳密なベンチマークを提供します。

English

LLMs trained on massive datasets may inadvertently acquire sensitive information such as personal details and potentially harmful content. This risk is further heightened in multimodal LLMs as they integrate information from multiple modalities (image and text). Adversaries can exploit this knowledge through multimodal prompts to extract sensitive details. Evaluating how effectively MLLMs can forget such information (targeted unlearning) necessitates the creation of high-quality, well-annotated image-text pairs. While prior work on unlearning has focused on text, multimodal unlearning remains underexplored. To address this gap, we first introduce a multimodal unlearning benchmark, UnLOK-VQA (Unlearning Outside Knowledge VQA), as well as an attack-and-defense framework to evaluate methods for deleting specific multimodal knowledge from MLLMs. We extend a visual question-answering dataset using an automated pipeline that generates varying-proximity samples for testing generalization and specificity, followed by manual filtering for maintaining high quality. We then evaluate six defense objectives against seven attacks (four whitebox, three blackbox), including a novel whitebox method leveraging interpretability of hidden states. Our results show multimodal attacks outperform text- or image-only ones, and that the most effective defense removes answer information from internal model states. Additionally, larger models exhibit greater post-editing robustness, suggesting that scale enhances safety. UnLOK-VQA provides a rigorous benchmark for advancing unlearning in MLLMs.

マルチモーダルLLMにおける機密情報の忘却：ベンチマークと攻撃防御評価

Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation

要旨

Summary

Support

Support