Wuerstchen: テキストから画像へのモデルの効率的な事前学習

要旨

我々は、テキストから画像への合成において競争力のある性能と前例のないコスト効率性、そして制約のあるハードウェアでの学習の容易さを兼ね備えた新技術「Wuerstchen」を紹介する。機械学習の最近の進歩を基盤として、我々のアプローチは強力な潜在画像圧縮率での潜在拡散戦略を活用し、最先端モデルに典型的に関連する計算負荷を大幅に削減しながら、生成される画像の品質を維持、あるいは向上させる。Wuerstchenは推論時間における顕著な速度向上を実現し、リアルタイムアプリケーションの実現可能性を高める。本手法の主な利点の一つは、わずか9,200 GPU時間という控えめな学習要件にあり、最終的な性能を損なうことなく通常のコストを大幅に削減する。最先端技術との比較において、本アプローチは強い競争力を発揮することが確認された。本論文は、性能と計算のアクセシビリティの両方を優先する新たな研究の道を開き、高度なAI技術の利用を民主化するものである。Wuerstchenを通じて、我々はテキストから画像への合成の領域において説得力のある前進を示し、将来の研究において探求すべき革新的な道筋を提供する。

English

We introduce Wuerstchen, a novel technique for text-to-image synthesis that unites competitive performance with unprecedented cost-effectiveness and ease of training on constrained hardware. Building on recent advancements in machine learning, our approach, which utilizes latent diffusion strategies at strong latent image compression rates, significantly reduces the computational burden, typically associated with state-of-the-art models, while preserving, if not enhancing, the quality of generated images. Wuerstchen achieves notable speed improvements at inference time, thereby rendering real-time applications more viable. One of the key advantages of our method lies in its modest training requirements of only 9,200 GPU hours, slashing the usual costs significantly without compromising the end performance. In a comparison against the state-of-the-art, we found the approach to yield strong competitiveness. This paper opens the door to a new line of research that prioritizes both performance and computational accessibility, hence democratizing the use of sophisticated AI technologies. Through Wuerstchen, we demonstrate a compelling stride forward in the realm of text-to-image synthesis, offering an innovative path to explore in future research.

Wuerstchen: テキストから画像へのモデルの効率的な事前学習

Wuerstchen: Efficient Pretraining of Text-to-Image Models

要旨

Support