SAEdit: 희소 자동인코더를 통한 토큰 수준 제어 기반 연속적 이미지 편집

초록

대규모 텍스트-이미지 확산 모델은 현대 이미지 편집의 핵심이 되었지만, 텍스트 프롬프트만으로는 편집 과정에 대한 충분한 제어를 제공하지 못합니다. 특히 두 가지 속성이 매우 바람직합니다: 분리성(disentanglement), 즉 하나의 속성을 변경할 때 다른 속성이 의도치 않게 변경되지 않는 것, 그리고 연속적 제어(continuous control), 즉 편집의 강도를 부드럽게 조절할 수 있는 것입니다. 우리는 텍스트 임베딩의 토큰 수준 조작을 통해 분리적이고 연속적인 편집을 가능하게 하는 방법을 소개합니다. 편집은 신중하게 선택된 방향을 따라 임베딩을 조작함으로써 적용되며, 이 방향은 대상 속성의 강도를 제어합니다. 이러한 방향을 식별하기 위해, 우리는 희소 자동 인코더(Sparse Autoencoder, SAE)를 사용하며, 이의 희소 잠재 공간은 의미적으로 격리된 차원을 드러냅니다. 우리의 방법은 확산 과정을 수정하지 않고 텍스트 임베딩에 직접 작용하므로, 모델에 구애받지 않으며 다양한 이미지 합성 백본에 광범위하게 적용 가능합니다. 실험 결과, 이 방법은 다양한 속성과 도메인에 걸쳐 직관적이고 효율적인 조작을 연속적 제어와 함께 가능하게 함을 보여줍니다.

English

Large-scale text-to-image diffusion models have become the backbone of modern image editing, yet text prompts alone do not offer adequate control over the editing process. Two properties are especially desirable: disentanglement, where changing one attribute does not unintentionally alter others, and continuous control, where the strength of an edit can be smoothly adjusted. We introduce a method for disentangled and continuous editing through token-level manipulation of text embeddings. The edits are applied by manipulating the embeddings along carefully chosen directions, which control the strength of the target attribute. To identify such directions, we employ a Sparse Autoencoder (SAE), whose sparse latent space exposes semantically isolated dimensions. Our method operates directly on text embeddings without modifying the diffusion process, making it model agnostic and broadly applicable to various image synthesis backbones. Experiments show that it enables intuitive and efficient manipulations with continuous control across diverse attributes and domains.

SAEdit: 희소 자동인코더를 통한 토큰 수준 제어 기반 연속적 이미지 편집

SAEdit: Token-level control for continuous image editing via Sparse AutoEncoder

초록

Support