香肠：文本到图像模型的高效预训练

摘要

我们介绍了一种名为Wuerstchen的文本到图像合成新技术，它将竞争性能与前所未有的成本效益和在受限硬件上轻松训练相结合。借鉴了机器学习领域的最新进展，我们的方法利用强潜在图像压缩率下的潜在扩散策略，显著减少了通常与最先进模型相关的计算负担，同时保留甚至增强了生成图像的质量。Wuerstchen在推断时实现了显著的速度改进，从而使实时应用更具可行性。我们方法的一个关键优势在于其仅需9200个GPU小时的适度训练要求，大幅削减了通常的成本，而不会影响最终性能。在与最先进技术的比较中，我们发现该方法具有强大的竞争力。本文打开了一条新的研究道路，优先考虑性能和计算可访问性，从而使复杂AI技术的使用民主化。通过Wuerstchen，我们展示了在文本到图像合成领域迈出的引人注目的一步，为未来研究提供了一条创新路径。

English

We introduce Wuerstchen, a novel technique for text-to-image synthesis that unites competitive performance with unprecedented cost-effectiveness and ease of training on constrained hardware. Building on recent advancements in machine learning, our approach, which utilizes latent diffusion strategies at strong latent image compression rates, significantly reduces the computational burden, typically associated with state-of-the-art models, while preserving, if not enhancing, the quality of generated images. Wuerstchen achieves notable speed improvements at inference time, thereby rendering real-time applications more viable. One of the key advantages of our method lies in its modest training requirements of only 9,200 GPU hours, slashing the usual costs significantly without compromising the end performance. In a comparison against the state-of-the-art, we found the approach to yield strong competitiveness. This paper opens the door to a new line of research that prioritizes both performance and computational accessibility, hence democratizing the use of sophisticated AI technologies. Through Wuerstchen, we demonstrate a compelling stride forward in the realm of text-to-image synthesis, offering an innovative path to explore in future research.

香肠：文本到图像模型的高效预训练

Wuerstchen: Efficient Pretraining of Text-to-Image Models

摘要

Support