ChatPaper.aiChatPaper

TRCE:面向文本到圖像擴散模型的可靠惡意概念消除

TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models

March 10, 2025
作者: Ruidong Chen, Honglin Guo, Lanjun Wang, Chenyu Zhang, Weizhi Nie, An-An Liu
cs.AI

摘要

近期,文本到圖像擴散模型的進展實現了逼真圖像的生成,但同時也帶來了產生惡意內容(如NSFW圖像)的風險。為降低此風險,研究者們探討了概念消除方法,以促使模型忘卻特定概念。然而,現有研究在完全消除隱含於提示中的惡意概念(例如隱喻表達或對抗性提示)的同時,難以保持模型的正常生成能力。為應對這一挑戰,本研究提出了TRCE,採用兩階段概念消除策略,在可靠消除與知識保留之間實現有效平衡。首先,TRCE從消除文本提示中隱含的惡意語義入手。通過識別關鍵映射目標(即[EoT]嵌入),我們優化交叉注意力層,將惡意提示映射到語境相似但包含安全概念的提示上。此步驟防止模型在去噪過程中過度受惡意語義影響。隨後,考慮到擴散模型採樣軌跡的確定性特性,TRCE進一步通過對比學習,引導早期去噪預測朝向安全方向並遠離不安全方向,從而進一步避免惡意內容的生成。最後,我們在多個惡意概念消除基準上對TRCE進行了全面評估,結果表明其在消除惡意概念的同時,更好地保留了模型的原始生成能力。代碼已開源於:http://github.com/ddgoodgood/TRCE。注意:本文包含模型生成內容,可能含有冒犯性材料。
English
Recent advances in text-to-image diffusion models enable photorealistic image generation, but they also risk producing malicious content, such as NSFW images. To mitigate risk, concept erasure methods are studied to facilitate the model to unlearn specific concepts. However, current studies struggle to fully erase malicious concepts implicitly embedded in prompts (e.g., metaphorical expressions or adversarial prompts) while preserving the model's normal generation capability. To address this challenge, our study proposes TRCE, using a two-stage concept erasure strategy to achieve an effective trade-off between reliable erasure and knowledge preservation. Firstly, TRCE starts by erasing the malicious semantics implicitly embedded in textual prompts. By identifying a critical mapping objective(i.e., the [EoT] embedding), we optimize the cross-attention layers to map malicious prompts to contextually similar prompts but with safe concepts. This step prevents the model from being overly influenced by malicious semantics during the denoising process. Following this, considering the deterministic properties of the sampling trajectory of the diffusion model, TRCE further steers the early denoising prediction toward the safe direction and away from the unsafe one through contrastive learning, thus further avoiding the generation of malicious content. Finally, we conduct comprehensive evaluations of TRCE on multiple malicious concept erasure benchmarks, and the results demonstrate its effectiveness in erasing malicious concepts while better preserving the model's original generation ability. The code is available at: http://github.com/ddgoodgood/TRCE. CAUTION: This paper includes model-generated content that may contain offensive material.

Summary

AI-Generated Summary

PDF31March 11, 2025