BeyondScene: 사전 학습된 확산 모델을 활용한 고해상도 인간 중심 장면 생성

초록

세부 사항과 제어 기능을 갖춘 고해상도 인간 중심 장면 생성은 기존의 텍스트-이미지 확산 모델들에게 여전히 도전 과제로 남아 있습니다. 이러한 도전은 제한된 학습 이미지 크기, 텍스트 인코더의 용량(토큰 제한), 그리고 다수의 인간이 포함된 복잡한 장면을 생성하는 데 내재된 어려움에서 비롯됩니다. 현재의 방법들은 학습 크기 제한만을 해결하려 시도했지만, 종종 심각한 아티팩트가 있는 인간 중심 장면을 생성했습니다. 우리는 BeyondScene라는 새로운 프레임워크를 제안하며, 이는 기존의 한계를 극복하고, 기존의 사전 학습된 확산 모델을 사용하여 탁월한 텍스트-이미지 일치성과 자연스러움을 갖춘 고해상도(8K 이상) 인간 중심 장면을 생성합니다. BeyondScene는 단계적이고 계층적인 접근 방식을 채택하여, 먼저 다수의 인간을 위한 인스턴스 생성에서 중요한 요소와 확산 모델의 토큰 제한을 넘어서는 세부 설명에 초점을 맞춘 상세한 기본 이미지를 생성한 다음, 이 기본 이미지를 고해상도 출력으로 원활하게 변환합니다. 이는 학습 이미지 크기를 초과하고, 우리가 제안한 고주파 주입 순방향 확산과 적응형 결합 확산으로 구성된 새로운 인스턴스 인식 계층적 확대 과정을 통해 텍스트와 인스턴스를 고려한 세부 사항을 통합합니다. BeyondScene는 상세한 텍스트 설명과의 일치성 및 자연스러움 측면에서 기존 방법들을 능가하며, 비용이 많이 드는 재학습 없이도 사전 학습된 확산 모델의 용량을 넘어서는 고해상도 인간 중심 장면 생성의 고급 응용을 위한 길을 열어줍니다. 프로젝트 페이지: https://janeyeon.github.io/beyond-scene.

English

Generating higher-resolution human-centric scenes with details and controls remains a challenge for existing text-to-image diffusion models. This challenge stems from limited training image size, text encoder capacity (limited tokens), and the inherent difficulty of generating complex scenes involving multiple humans. While current methods attempted to address training size limit only, they often yielded human-centric scenes with severe artifacts. We propose BeyondScene, a novel framework that overcomes prior limitations, generating exquisite higher-resolution (over 8K) human-centric scenes with exceptional text-image correspondence and naturalness using existing pretrained diffusion models. BeyondScene employs a staged and hierarchical approach to initially generate a detailed base image focusing on crucial elements in instance creation for multiple humans and detailed descriptions beyond token limit of diffusion model, and then to seamlessly convert the base image to a higher-resolution output, exceeding training image size and incorporating details aware of text and instances via our novel instance-aware hierarchical enlargement process that consists of our proposed high-frequency injected forward diffusion and adaptive joint diffusion. BeyondScene surpasses existing methods in terms of correspondence with detailed text descriptions and naturalness, paving the way for advanced applications in higher-resolution human-centric scene creation beyond the capacity of pretrained diffusion models without costly retraining. Project page: https://janeyeon.github.io/beyond-scene.

BeyondScene: 사전 학습된 확산 모델을 활용한 고해상도 인간 중심 장면 생성

BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion

초록

Support