SAeUron：利用稀疏自編碼器在擴散模型中實現可解釋的概念遺忘

摘要

擴散模型雖然功能強大，但可能會意外生成有害或不良內容，引發重大的道德和安全疑慮。最近的機器遺忘方法提供潛在解決方案，但往往缺乏透明度，使人難以理解它們對基礎模型引入的變化。在這項工作中，我們介紹了SAeUron，一種利用稀疏自編碼器（SAEs）學習的特徵來消除文本到圖像擴散模型中不需要的概念的新方法。首先，我們展示了在多個去噪時間步驟的擴散模型激活上以非監督方式訓練的SAEs捕獲了對應特定概念的稀疏且可解釋的特徵。基於此，我們提出了一種特徵選擇方法，使模型激活上的精確干預能夠阻止目標內容，同時保持整體性能。通過對物體和風格遺忘的競爭性UnlearnCanvas基準的評估突顯了SAeUron的最先進性能。此外，我們展示了單個SAE可以同時移除多個概念，並且與其他方法相比，SAeUron減輕了即使在對抗性攻擊下也可能生成不需要的內容的可能性。代碼和檢查點可在以下網址獲得：https://github.com/cywinski/SAeUron。

English

Diffusion models, while powerful, can inadvertently generate harmful or undesirable content, raising significant ethical and safety concerns. Recent machine unlearning approaches offer potential solutions but often lack transparency, making it difficult to understand the changes they introduce to the base model. In this work, we introduce SAeUron, a novel method leveraging features learned by sparse autoencoders (SAEs) to remove unwanted concepts in text-to-image diffusion models. First, we demonstrate that SAEs, trained in an unsupervised manner on activations from multiple denoising timesteps of the diffusion model, capture sparse and interpretable features corresponding to specific concepts. Building on this, we propose a feature selection method that enables precise interventions on model activations to block targeted content while preserving overall performance. Evaluation with the competitive UnlearnCanvas benchmark on object and style unlearning highlights SAeUron's state-of-the-art performance. Moreover, we show that with a single SAE, we can remove multiple concepts simultaneously and that in contrast to other methods, SAeUron mitigates the possibility of generating unwanted content, even under adversarial attack. Code and checkpoints are available at: https://github.com/cywinski/SAeUron.

SAeUron：利用稀疏自編碼器在擴散模型中實現可解釋的概念遺忘

SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

摘要

Support