SDXL：改善潛在擴散模型以進行高解析度圖像合成

摘要

我們提出了SDXL，一種用於文本到圖像合成的潛在擴散模型。相較於之前的穩定擴散版本，SDXL利用了三倍大的UNet主幹：模型參數的增加主要來自更多的注意力塊和更大的交叉注意力上下文，因為SDXL使用了第二個文本編碼器。我們設計了多種新穎的條件方案，並在多個長寬比上訓練了SDXL。我們還引入了一個細化模型，用於通過一種事後圖像到圖像技術改善SDXL生成的樣本的視覺保真度。我們展示了SDXL相較於之前版本的穩定擴散有顯著改進的性能，並實現了與黑盒最先進圖像生成器相競爭的結果。為了促進開放研究並促進大型模型訓練和評估的透明度，我們提供了代碼和模型權重的訪問權限，網址為https://github.com/Stability-AI/generative-models

English

We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators. In the spirit of promoting open research and fostering transparency in large model training and evaluation, we provide access to code and model weights at https://github.com/Stability-AI/generative-models

SDXL：改善潛在擴散模型以進行高解析度圖像合成

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

摘要

Support