任意尺寸扩散：面向任意尺寸高清图像的高效文本驱动合成

摘要

稳定扩散是一种用于文本到图像合成的生成模型，在生成不同尺寸的图像时经常遇到由分辨率引起的构图问题。这个问题主要源于该模型在训练时使用了一对单一尺度图像及其相应的文本描述。此外，直接在无限尺寸的图像上进行训练是不可行的，因为这将需要大量的文本-图像对，并且需要巨大的计算开销。为了克服这些挑战，我们提出了一个名为任意尺寸扩散（ASD）的两阶段流程，旨在有效生成任意尺寸的构图良好的图像，同时最大限度地减少对高内存 GPU 资源的需求。具体而言，初始阶段被称为任意比例适应性扩散（ARAD），利用一组选择的具有受限比率范围的图像来优化文本条件扩散模型，从而提高其调整构图以适应不同图像尺寸的能力。为了支持在任何所需尺寸上创建图像，我们在随后的阶段进一步引入了一种称为快速无缝平铺扩散（FSTD）的技术。这种方法允许将 ASD 输出快速放大到任何高分辨率尺寸，避免接缝伪影或内存超载。在 LAION-COCO 和 MM-CelebA-HQ 基准测试上的实验结果表明，ASD 能够生成任意尺寸的结构良好的图像，将推理时间缩短了 2 倍，相较于传统的平铺算法。

English

Stable diffusion, a generative model used in text-to-image synthesis, frequently encounters resolution-induced composition problems when generating images of varying sizes. This issue primarily stems from the model being trained on pairs of single-scale images and their corresponding text descriptions. Moreover, direct training on images of unlimited sizes is unfeasible, as it would require an immense number of text-image pairs and entail substantial computational expenses. To overcome these challenges, we propose a two-stage pipeline named Any-Size-Diffusion (ASD), designed to efficiently generate well-composed images of any size, while minimizing the need for high-memory GPU resources. Specifically, the initial stage, dubbed Any Ratio Adaptability Diffusion (ARAD), leverages a selected set of images with a restricted range of ratios to optimize the text-conditional diffusion model, thereby improving its ability to adjust composition to accommodate diverse image sizes. To support the creation of images at any desired size, we further introduce a technique called Fast Seamless Tiled Diffusion (FSTD) at the subsequent stage. This method allows for the rapid enlargement of the ASD output to any high-resolution size, avoiding seaming artifacts or memory overloads. Experimental results on the LAION-COCO and MM-CelebA-HQ benchmarks demonstrate that ASD can produce well-structured images of arbitrary sizes, cutting down the inference time by 2x compared to the traditional tiled algorithm.

任意尺寸扩散：面向任意尺寸高清图像的高效文本驱动合成

Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images

摘要

Support