Glyph-ByT5: 正確な視覚的テキストレンダリングのためのカスタマイズされたテキストエンコーダー

要旨

視覚的テキストレンダリングは、現代のテキストから画像生成モデルにとって根本的な課題を提起しており、その核心的な問題はテキストエンコーダの欠陥にあります。正確なテキストレンダリングを実現するために、テキストエンコーダにとって重要な2つの要件を特定しました：文字認識とグリフとの整合性です。私たちの解決策は、文字認識を備えたByT5エンコーダを、厳選されたグリフ-テキストペアデータセットを用いて微調整し、Glyph-ByT5という一連のカスタマイズされたテキストエンコーダを構築することです。Glyph-ByT5をSDXLと統合する効果的な方法を提示し、デザイン画像生成のためのGlyph-SDXLモデルを作成しました。これにより、テキストレンダリングの精度が大幅に向上し、私たちのデザイン画像ベンチマークで20%未満からほぼ90%に改善されました。注目すべきは、Glyph-SDXLが新たにテキスト段落のレンダリング能力を獲得し、数十から数百文字の高いスペル精度を自動化された複数行レイアウトで達成したことです。最後に、視覚的テキストを含む少数の高品質なフォトリアルな画像でGlyph-SDXLを微調整することで、オープンドメインの実画像におけるシーンテキストレンダリング能力の大幅な向上を示しました。これらの説得力のある結果は、多様で挑戦的なタスクのためのカスタマイズされたテキストエンコーダの設計におけるさらなる探求を促すことを目指しています。

English

Visual text rendering poses a fundamental challenge for contemporary text-to-image generation models, with the core problem lying in text encoder deficiencies. To achieve accurate text rendering, we identify two crucial requirements for text encoders: character awareness and alignment with glyphs. Our solution involves crafting a series of customized text encoder, Glyph-ByT5, by fine-tuning the character-aware ByT5 encoder using a meticulously curated paired glyph-text dataset. We present an effective method for integrating Glyph-ByT5 with SDXL, resulting in the creation of the Glyph-SDXL model for design image generation. This significantly enhances text rendering accuracy, improving it from less than 20% to nearly 90% on our design image benchmark. Noteworthy is Glyph-SDXL's newfound ability for text paragraph rendering, achieving high spelling accuracy for tens to hundreds of characters with automated multi-line layouts. Finally, through fine-tuning Glyph-SDXL with a small set of high-quality, photorealistic images featuring visual text, we showcase a substantial improvement in scene text rendering capabilities in open-domain real images. These compelling outcomes aim to encourage further exploration in designing customized text encoders for diverse and challenging tasks.

Glyph-ByT5: 正確な視覚的テキストレンダリングのためのカスタマイズされたテキストエンコーダー

Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering

要旨

Support