쥬라기 월드 리메이크: 제로샷 롱 이미지-투-이미지 변환을 통해 고대 화석을 되살리다

초록

자연어를 통해 대상 도메인에 대한 깊은 이해를 바탕으로, 우리는 큰 도메인 격차를 넘나드는 번역과 스켈레톤을 다시 살리는 데 있어 유망한 결과를 도출합니다. 본 연구에서는 텍스트 기반 잠재 확산 모델을 사용하여 큰 도메인 격차(긴I2I)를 넘는 제로샷 이미지-투-이미지 번역(I2I)을 수행합니다. 여기서는 대상 도메인에 진입하기 위해 새로운 시각적 특징과 기하학적 구조를 대량으로 생성해야 합니다. 이러한 큰 도메인 격차를 넘는 번역 능력은 범죄학, 점성술, 환경 보호, 고생물학 등 다양한 실제 응용 분야에서 활용될 수 있습니다. 본 연구에서는 두개골과 살아있는 동물 간의 번역을 위한 새로운 작업인 Skull2Animal을 소개합니다. 이 작업에서 우리는 지도되지 않은 생성적 적대 신경망(GAN)이 큰 도메인 격차를 넘는 번역을 수행할 수 없다는 사실을 발견했습니다. 이러한 전통적인 I2I 방법 대신, 우리는 지도 확산 및 이미지 편집 모델의 사용을 탐구하고, 텍스트 프롬프트 기반 잠재 확산 모델을 통해 제로샷 I2I를 수행할 수 있는 새로운 벤치마크 모델인 Revive-2I를 제안합니다. 우리는 긴I2I를 위해 지도가 필요하다는 것을 발견했는데, 이는 큰 도메인 격차를 메우기 위해서는 대상 도메인에 대한 사전 지식이 필요하기 때문입니다. 또한, 우리는 프롬프팅이 대상 도메인에 대한 최적의 정보를 제공하며 확장성이 가장 뛰어나다는 것을 발견했습니다. 이는 분류기 기반 확산 모델이 특정 사용 사례에 대해 재학습이 필요하고, 훈련된 이미지의 다양성 때문에 대상 도메인에 대한 강력한 제약이 부족하기 때문입니다.

English

With a strong understanding of the target domain from natural language, we produce promising results in translating across large domain gaps and bringing skeletons back to life. In this work, we use text-guided latent diffusion models for zero-shot image-to-image translation (I2I) across large domain gaps (longI2I), where large amounts of new visual features and new geometry need to be generated to enter the target domain. Being able to perform translations across large domain gaps has a wide variety of real-world applications in criminology, astrology, environmental conservation, and paleontology. In this work, we introduce a new task Skull2Animal for translating between skulls and living animals. On this task, we find that unguided Generative Adversarial Networks (GANs) are not capable of translating across large domain gaps. Instead of these traditional I2I methods, we explore the use of guided diffusion and image editing models and provide a new benchmark model, Revive-2I, capable of performing zero-shot I2I via text-prompting latent diffusion models. We find that guidance is necessary for longI2I because, to bridge the large domain gap, prior knowledge about the target domain is needed. In addition, we find that prompting provides the best and most scalable information about the target domain as classifier-guided diffusion models require retraining for specific use cases and lack stronger constraints on the target domain because of the wide variety of images they are trained on.

쥬라기 월드 리메이크: 제로샷 롱 이미지-투-이미지 변환을 통해 고대 화석을 되살리다

Jurassic World Remake: Bringing Ancient Fossils Back to Life via Zero-Shot Long Image-to-Image Translation

초록

Support