コンセプトステアラー：制御可能な生成のためのKスパースオートエンコーダーの活用

要旨

テキストから画像を生成するモデルにおいては、顕著な進歩が見られますが、敵対的攻撃に対して脆弱であり、安全で倫理的でないコンテンツを誤って生成することがあります。既存の手法は、特定の概念を除去するためにモデルを微調整することが一般的ですが、これは計算コストが高く、スケーラビリティに欠ける上に生成品質を損なう可能性があります。本研究では、拡散モデルにおいて効率的かつ解釈可能な概念操作を可能にする、k-疎なオートエンコーダー（k-SAEs）を活用する新しいフレームワークを提案します。具体的には、まずテキスト埋め込みの潜在空間において解釈可能な単義的概念を特定し、それらを活用して生成を特定の概念（例：裸体）から遠ざけたり近づけたりするか、新しい概念（例：写真のスタイル）を導入するように誘導します。幅広い実験を通じて、当該手法が非常にシンプルであり、基本モデルの再トレーニングやLoRAアダプターの必要がなく、生成品質を損なわず、敵対的なプロンプト操作にも強いことを示します。当手法は、安全でない概念の削除において20.01%の改善をもたらし、スタイル操作に効果的であり、現行の最先端技術よりも5倍高速であることが示されました。

English

Despite the remarkable progress in text-to-image generative models, they are prone to adversarial attacks and inadvertently generate unsafe, unethical content. Existing approaches often rely on fine-tuning models to remove specific concepts, which is computationally expensive, lack scalability, and/or compromise generation quality. In this work, we propose a novel framework leveraging k-sparse autoencoders (k-SAEs) to enable efficient and interpretable concept manipulation in diffusion models. Specifically, we first identify interpretable monosemantic concepts in the latent space of text embeddings and leverage them to precisely steer the generation away or towards a given concept (e.g., nudity) or to introduce a new concept (e.g., photographic style). Through extensive experiments, we demonstrate that our approach is very simple, requires no retraining of the base model nor LoRA adapters, does not compromise the generation quality, and is robust to adversarial prompt manipulations. Our method yields an improvement of 20.01% in unsafe concept removal, is effective in style manipulation, and is sim5x faster than current state-of-the-art.

コンセプトステアラー：制御可能な生成のためのKスパースオートエンコーダーの活用

Concept Steerers: Leveraging K-Sparse Autoencoders for Controllable Generations

要旨

Support