M-ErasureBench:扩散模型概念擦除综合多模态评估基准
M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models
December 28, 2025
作者: Ju-Hsuan Weng, Jia-Wei Liao, Cheng-Fu Chou, Jun-Cheng Chen
cs.AI
摘要
文本到图像扩散模型可能生成有害或受版权保护的内容,这推动了对概念擦除技术的研究。然而,现有方法主要聚焦于从文本提示中擦除概念,忽视了在图像编辑和个性化生成等实际应用中日益重要的其他输入模态。这些模态可能成为攻击面,导致已擦除的概念绕过防御机制重新出现。为填补这一空白,我们提出M-ErasureBench——一个新颖的多模态评估框架,系统性地在三种输入模态(文本提示、学习嵌入和反转潜在表示)上对概念擦除方法进行基准测试。针对后两种模态,我们分别评估白盒与黑盒访问场景,共形成五种测试情境。分析表明,现有方法对文本提示能实现较强的擦除效果,但在学习嵌入和反转潜在表示场景下大多失效,其中白盒设置下的概念再现率(CRR)超过90%。为应对这些漏洞,我们提出IRECE(推理时概念擦除鲁棒性增强),这是一种即插即用模块,通过交叉注意力定位目标概念并在去噪过程中扰动相关潜在表示。实验证明,IRECE能持续恢复系统鲁棒性,在最具挑战性的白盒潜在反转场景下将CRR降低达40%,同时保持视觉质量。据我们所知,M-ErasureBench首次建立了超越文本提示的全面概念擦除基准。结合IRECE,我们的基准测试为构建更可靠的保护性生成模型提供了实用保障方案。
English
Text-to-image diffusion models may generate harmful or copyrighted content, motivating research on concept erasure. However, existing approaches primarily focus on erasing concepts from text prompts, overlooking other input modalities that are increasingly critical in real-world applications such as image editing and personalized generation. These modalities can become attack surfaces, where erased concepts re-emerge despite defenses. To bridge this gap, we introduce M-ErasureBench, a novel multimodal evaluation framework that systematically benchmarks concept erasure methods across three input modalities: text prompts, learned embeddings, and inverted latents. For the latter two, we evaluate both white-box and black-box access, yielding five evaluation scenarios. Our analysis shows that existing methods achieve strong erasure performance against text prompts but largely fail under learned embeddings and inverted latents, with Concept Reproduction Rate (CRR) exceeding 90% in the white-box setting. To address these vulnerabilities, we propose IRECE (Inference-time Robustness Enhancement for Concept Erasure), a plug-and-play module that localizes target concepts via cross-attention and perturbs the associated latents during denoising. Experiments demonstrate that IRECE consistently restores robustness, reducing CRR by up to 40% under the most challenging white-box latent inversion scenario, while preserving visual quality. To the best of our knowledge, M-ErasureBench provides the first comprehensive benchmark of concept erasure beyond text prompts. Together with IRECE, our benchmark offers practical safeguards for building more reliable protective generative models.