적게에서 더 많이로의 일반화: 인-컨텍스트 생성을 통해 더 많은 제어 가능성 확보

초록

광범위한 응용 분야로 인해 이미지 생성 분야에서 주체 기반 생성이 광범위하게 탐구되었음에도 불구하고, 데이터 확장성과 주체 확장성 측면에서 여전히 과제가 남아 있습니다. 첫 번째 과제로, 단일 주체 데이터셋에서 다중 주체 데이터셋으로 전환하고 이를 확장하는 것은 특히 어려운 작업입니다. 두 번째 과제로, 최근의 대부분의 방법들은 단일 주체 생성에 초점을 맞추고 있어 다중 주체 시나리오를 다룰 때 적용하기 어렵습니다. 본 연구에서는 이러한 과제를 해결하기 위해 높은 일관성을 가진 데이터 합성 파이프라인을 제안합니다. 이 파이프라인은 디퓨전 트랜스포머의 내재적인 문맥 내 생성 능력을 활용하여 높은 일관성을 가진 다중 주체 쌍 데이터를 생성합니다. 또한, 점진적인 교차 모달 정렬과 범용 회전 위치 임베딩으로 구성된 UNO를 소개합니다. UNO는 텍스트-이미지 모델에서 반복적으로 훈련된 다중 이미지 조건 기반 주체-이미지 모델입니다. 광범위한 실험을 통해 우리의 방법이 단일 주체 및 다중 주체 기반 생성 모두에서 높은 일관성을 유지하면서도 제어 가능성을 보장할 수 있음을 입증했습니다.

English

Although subject-driven generation has been extensively explored in image generation due to its wide applications, it still has challenges in data scalability and subject expansibility. For the first challenge, moving from curating single-subject datasets to multiple-subject ones and scaling them is particularly difficult. For the second, most recent methods center on single-subject generation, making it hard to apply when dealing with multi-subject scenarios. In this study, we propose a highly-consistent data synthesis pipeline to tackle this challenge. This pipeline harnesses the intrinsic in-context generation capabilities of diffusion transformers and generates high-consistency multi-subject paired data. Additionally, we introduce UNO, which consists of progressive cross-modal alignment and universal rotary position embedding. It is a multi-image conditioned subject-to-image model iteratively trained from a text-to-image model. Extensive experiments show that our method can achieve high consistency while ensuring controllability in both single-subject and multi-subject driven generation.

적게에서 더 많이로의 일반화: 인-컨텍스트 생성을 통해 더 많은 제어 가능성 확보

Less-to-More Generalization: Unlocking More Controllability by In-Context Generation

초록

Support