任意大小擴散:朝向有效的文本驅動合成,適用於任意大小的高清圖像
Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images
August 31, 2023
作者: Qingping Zheng, Yuanfan Guo, Jiankang Deng, Jianhua Han, Ying Li, Songcen Xu, Hang Xu
cs.AI
摘要
穩定擴散是一種在文本到圖像合成中使用的生成模型,當生成不同尺寸的圖像時,常常會遇到由解析度引起的構圖問題。這個問題主要源於該模型是在單一尺度圖像及其對應的文本描述對上進行訓練。此外,直接在無限尺寸的圖像上進行訓練是不可行的,因為這將需要大量的文本-圖像對並且需要龐大的計算開銷。為了克服這些挑戰,我們提出了一個名為任意尺寸擴散(ASD)的兩階段流程,旨在有效生成任何尺寸的構圖良好的圖像,同時最大程度地減少對高內存 GPU 資源的需求。具體而言,初始階段被稱為任意比例適應性擴散(ARAD),利用一組選定的具有受限比例範圍的圖像來優化文本條件擴散模型,從而提高其調整構圖以容納不同圖像尺寸的能力。為了支持在任何所需尺寸上創建圖像,我們進一步引入了一種稱為快速無縫平鋪擴散(FSTD)的技術在後續階段。這種方法允許將 ASD 的輸出快速放大到任何高分辨率尺寸,避免接縫瑕疵或內存超載。在 LAION-COCO 和 MM-CelebA-HQ 基準測試上的實驗結果表明,ASD 能夠生成任意大小的結構良好的圖像,將推理時間比傳統平鋪算法減少了 2 倍。
English
Stable diffusion, a generative model used in text-to-image synthesis,
frequently encounters resolution-induced composition problems when generating
images of varying sizes. This issue primarily stems from the model being
trained on pairs of single-scale images and their corresponding text
descriptions. Moreover, direct training on images of unlimited sizes is
unfeasible, as it would require an immense number of text-image pairs and
entail substantial computational expenses. To overcome these challenges, we
propose a two-stage pipeline named Any-Size-Diffusion (ASD), designed to
efficiently generate well-composed images of any size, while minimizing the
need for high-memory GPU resources. Specifically, the initial stage, dubbed Any
Ratio Adaptability Diffusion (ARAD), leverages a selected set of images with a
restricted range of ratios to optimize the text-conditional diffusion model,
thereby improving its ability to adjust composition to accommodate diverse
image sizes. To support the creation of images at any desired size, we further
introduce a technique called Fast Seamless Tiled Diffusion (FSTD) at the
subsequent stage. This method allows for the rapid enlargement of the ASD
output to any high-resolution size, avoiding seaming artifacts or memory
overloads. Experimental results on the LAION-COCO and MM-CelebA-HQ benchmarks
demonstrate that ASD can produce well-structured images of arbitrary sizes,
cutting down the inference time by 2x compared to the traditional tiled
algorithm.