Wuerstchen：文本到圖像模型的高效預訓練

摘要

我們介紹了一種名為Wuerstchen的新穎文本到圖像合成技術，它結合了競爭性表現與前所未有的成本效益和在受限硬體上訓練的便利性。借鑒機器學習的最新進展，我們的方法利用強潛在圖像壓縮率下的潛在擴散策略，顯著降低了與最先進模型通常相關的計算負擔，同時保留，甚至增強了生成圖像的質量。Wuerstchen在推理時間方面實現了顯著的速度改進，從而使實時應用更具可行性。我們方法的一個關鍵優勢在於僅需9200個GPU小時的適度訓練需求，大幅削減了通常成本，而不會影響最終性能。通過與最先進技術的比較，我們發現這種方法具有強大的競爭力。本文開啟了一條新的研究路線，優先考慮性能和計算可訪問性，從而實現對複雜AI技術的民主化使用。通過Wuerstchen，我們展示了在文本到圖像合成領域中向前邁出的引人注目一步，為未來研究提供了一條創新的探索途徑。

English

We introduce Wuerstchen, a novel technique for text-to-image synthesis that unites competitive performance with unprecedented cost-effectiveness and ease of training on constrained hardware. Building on recent advancements in machine learning, our approach, which utilizes latent diffusion strategies at strong latent image compression rates, significantly reduces the computational burden, typically associated with state-of-the-art models, while preserving, if not enhancing, the quality of generated images. Wuerstchen achieves notable speed improvements at inference time, thereby rendering real-time applications more viable. One of the key advantages of our method lies in its modest training requirements of only 9,200 GPU hours, slashing the usual costs significantly without compromising the end performance. In a comparison against the state-of-the-art, we found the approach to yield strong competitiveness. This paper opens the door to a new line of research that prioritizes both performance and computational accessibility, hence democratizing the use of sophisticated AI technologies. Through Wuerstchen, we demonstrate a compelling stride forward in the realm of text-to-image synthesis, offering an innovative path to explore in future research.

Wuerstchen：文本到圖像模型的高效預訓練

Wuerstchen: Efficient Pretraining of Text-to-Image Models

摘要

Support