概念導向者：利用 K-稀疏自編碼器進行可控生成

摘要

儘管文本到圖像生成模型取得了顯著進展，但容易受到對抗攻擊，並且會意外生成不安全、不道德的內容。現有方法通常依賴微調模型來消除特定概念，這在計算上昂貴，缺乏可擴展性，或者會影響生成質量。在這項工作中，我們提出了一個新的框架，利用 k-稀疏自編碼器（k-SAEs）在擴散模型中實現高效且可解釋的概念操作。具體而言，我們首先在文本嵌入的潛在空間中識別可解釋的單義概念，並利用它們精確地引導生成遠離或朝向特定概念（例如裸露）或引入新概念（例如攝影風格）。通過大量實驗，我們證明我們的方法非常簡單，無需重新訓練基礎模型或 LoRA 轉接器，不會影響生成質量，並且對對抗提示操作具有強韌性。我們的方法在不安全概念去除方面實現了 20.01% 的改進，在風格操作方面效果顯著，並且比當前最先進的方法快 5 倍。

English

Despite the remarkable progress in text-to-image generative models, they are prone to adversarial attacks and inadvertently generate unsafe, unethical content. Existing approaches often rely on fine-tuning models to remove specific concepts, which is computationally expensive, lack scalability, and/or compromise generation quality. In this work, we propose a novel framework leveraging k-sparse autoencoders (k-SAEs) to enable efficient and interpretable concept manipulation in diffusion models. Specifically, we first identify interpretable monosemantic concepts in the latent space of text embeddings and leverage them to precisely steer the generation away or towards a given concept (e.g., nudity) or to introduce a new concept (e.g., photographic style). Through extensive experiments, we demonstrate that our approach is very simple, requires no retraining of the base model nor LoRA adapters, does not compromise the generation quality, and is robust to adversarial prompt manipulations. Our method yields an improvement of 20.01% in unsafe concept removal, is effective in style manipulation, and is sim5x faster than current state-of-the-art.

概念導向者：利用 K-稀疏自編碼器進行可控生成

Concept Steerers: Leveraging K-Sparse Autoencoders for Controllable Generations

摘要

Support