SSR-Encoder: 주체 기반 생성을 위한 선택적 주체 표현 인코딩

초록

주제 기반 이미지 생성 분야의 최근 발전으로 제로샷 생성이 가능해졌지만, 핵심 주제 표현을 정확하게 선택하고 집중하는 것은 여전히 어려운 과제로 남아 있습니다. 이를 해결하기 위해, 우리는 단일 또는 다수의 참조 이미지에서 주제를 선택적으로 포착하도록 설계된 새로운 아키텍처인 SSR-Encoder를 소개합니다. 이 모델은 테스트 시 미세 조정 없이도 텍스트와 마스크를 포함한 다양한 쿼리 방식에 응답할 수 있습니다. SSR-Encoder는 쿼리 입력을 이미지 패치와 정렬하는 Token-to-Patch Aligner와 주제의 세부 특징을 추출하고 보존하는 Detail-Preserving Subject Encoder를 결합하여 주제 임베딩을 생성합니다. 이러한 임베딩은 원본 텍스트 임베딩과 함께 사용되어 생성 과정을 조건화합니다. 모델의 일반화 가능성과 효율성을 특징으로 하는 SSR-Encoder는 다양한 맞춤형 모델과 제어 모듈에 적응할 수 있습니다. 개선된 학습을 위한 Embedding Consistency Regularization Loss로 강화된 우리의 광범위한 실험은 다재다능하고 고품질의 이미지 생성에서의 효과를 입증하며, 이 모델의 광범위한 적용 가능성을 시사합니다. 프로젝트 페이지: https://ssr-encoder.github.io

English

Recent advancements in subject-driven image generation have led to zero-shot generation, yet precise selection and focus on crucial subject representations remain challenging. Addressing this, we introduce the SSR-Encoder, a novel architecture designed for selectively capturing any subject from single or multiple reference images. It responds to various query modalities including text and masks, without necessitating test-time fine-tuning. The SSR-Encoder combines a Token-to-Patch Aligner that aligns query inputs with image patches and a Detail-Preserving Subject Encoder for extracting and preserving fine features of the subjects, thereby generating subject embeddings. These embeddings, used in conjunction with original text embeddings, condition the generation process. Characterized by its model generalizability and efficiency, the SSR-Encoder adapts to a range of custom models and control modules. Enhanced by the Embedding Consistency Regularization Loss for improved training, our extensive experiments demonstrate its effectiveness in versatile and high-quality image generation, indicating its broad applicability. Project page: https://ssr-encoder.github.io

SSR-Encoder: 주체 기반 생성을 위한 선택적 주체 표현 인코딩

SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation

초록

Support