Any-Size-Diffusion: 임의 크기의 HD 이미지를 위한 효율적인 텍스트 기반 합성 기술

초록

텍스트-이미지 합성에 사용되는 생성 모델인 스테이블 디퓨전(Stable Diffusion)은 다양한 크기의 이미지를 생성할 때 해상도로 인한 구성 문제를 자주 마주칩니다. 이 문제는 주로 단일 스케일의 이미지와 해당 텍스트 설명 쌍으로 학습된 모델에서 비롯됩니다. 또한, 무제한 크기의 이미지에 대한 직접적인 학습은 엄청난 수의 텍스트-이미지 쌍과 상당한 계산 비용을 필요로 하기 때문에 실현 불가능합니다. 이러한 문제를 극복하기 위해, 우리는 Any-Size-Diffusion(ASD)이라는 두 단계의 파이프라인을 제안합니다. 이 파이프라인은 고메모리 GPU 자원의 필요성을 최소화하면서도 모든 크기의 잘 구성된 이미지를 효율적으로 생성하도록 설계되었습니다. 구체적으로, 첫 번째 단계인 Any Ratio Adaptability Diffusion(ARAD)은 제한된 비율 범위의 이미지 세트를 활용하여 텍스트 조건부 디퓨전 모델을 최적화함으로써 다양한 이미지 크기에 맞춰 구성을 조정하는 능력을 향상시킵니다. 원하는 크기의 이미지 생성을 지원하기 위해, 우리는 두 번째 단계에서 Fast Seamless Tiled Diffusion(FSTD)이라는 기술을 추가로 도입합니다. 이 방법은 ASD 출력을 빠르게 고해상도 크기로 확대할 수 있게 하며, 이음새 아티팩트나 메모리 과부하를 방지합니다. LAION-COCO 및 MM-CelebA-HQ 벤치마크에서의 실험 결과는 ASD가 임의의 크기로 잘 구조화된 이미지를 생성할 수 있으며, 기존의 타일 알고리즘에 비해 추론 시간을 2배 단축할 수 있음을 보여줍니다.

English

Stable diffusion, a generative model used in text-to-image synthesis, frequently encounters resolution-induced composition problems when generating images of varying sizes. This issue primarily stems from the model being trained on pairs of single-scale images and their corresponding text descriptions. Moreover, direct training on images of unlimited sizes is unfeasible, as it would require an immense number of text-image pairs and entail substantial computational expenses. To overcome these challenges, we propose a two-stage pipeline named Any-Size-Diffusion (ASD), designed to efficiently generate well-composed images of any size, while minimizing the need for high-memory GPU resources. Specifically, the initial stage, dubbed Any Ratio Adaptability Diffusion (ARAD), leverages a selected set of images with a restricted range of ratios to optimize the text-conditional diffusion model, thereby improving its ability to adjust composition to accommodate diverse image sizes. To support the creation of images at any desired size, we further introduce a technique called Fast Seamless Tiled Diffusion (FSTD) at the subsequent stage. This method allows for the rapid enlargement of the ASD output to any high-resolution size, avoiding seaming artifacts or memory overloads. Experimental results on the LAION-COCO and MM-CelebA-HQ benchmarks demonstrate that ASD can produce well-structured images of arbitrary sizes, cutting down the inference time by 2x compared to the traditional tiled algorithm.

Any-Size-Diffusion: 임의 크기의 HD 이미지를 위한 효율적인 텍스트 기반 합성 기술

Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images

초록

Support