Glyph-ByT5-v2：正確な多言語視覚テキストレンダリングのための強力な美的ベースライン

要旨

最近、Glyph-ByT5はグラフィックデザイン画像において高度に正確な視覚的テキストレンダリング性能を達成しました。しかし、依然として英語のみに焦点を当てており、視覚的魅力の点では比較的劣る性能を示しています。本研究では、これらの2つの根本的な制約に対処するため、Glyph-ByT5-v2とGlyph-SDXL-v2を提案します。これらは、10の異なる言語において正確な視覚的テキストレンダリングをサポートするだけでなく、はるかに優れた美的品質を実現します。これを達成するために、以下の貢献を行いました：(i) 100万以上のグリフテキストペアと、他の9言語をカバーする1000万のグラフィックデザイン画像テキストペアからなる高品質な多言語グリフテキストおよびグラフィックデザインデータセットの作成、(ii) 各言語100件、合計1000件のプロンプトからなる多言語視覚段落ベンチマークを構築し、多言語視覚スペリングの正確性を評価、(iii) 最新のステップ対応選好学習アプローチを活用して視覚的美的品質を向上。これらの技術を組み合わせることで、強力なカスタマイズされた多言語テキストエンコーダGlyph-ByT5-v2と、10の異なる言語で正確なスペリングをサポートする強力な美的グラフィック生成モデルGlyph-SDXL-v2を提供します。最新のDALL-E3やIdeogram 1.0が依然として多言語視覚テキストレンダリングタスクに苦戦していることを考慮すると、本研究は重要な進展であると考えます。

English

Recently, Glyph-ByT5 has achieved highly accurate visual text rendering performance in graphic design images. However, it still focuses solely on English and performs relatively poorly in terms of visual appeal. In this work, we address these two fundamental limitations by presenting Glyph-ByT5-v2 and Glyph-SDXL-v2, which not only support accurate visual text rendering for 10 different languages but also achieve much better aesthetic quality. To achieve this, we make the following contributions: (i) creating a high-quality multilingual glyph-text and graphic design dataset consisting of more than 1 million glyph-text pairs and 10 million graphic design image-text pairs covering nine other languages, (ii) building a multilingual visual paragraph benchmark consisting of 1,000 prompts, with 100 for each language, to assess multilingual visual spelling accuracy, and (iii) leveraging the latest step-aware preference learning approach to enhance the visual aesthetic quality. With the combination of these techniques, we deliver a powerful customized multilingual text encoder, Glyph-ByT5-v2, and a strong aesthetic graphic generation model, Glyph-SDXL-v2, that can support accurate spelling in 10 different languages. We perceive our work as a significant advancement, considering that the latest DALL-E3 and Ideogram 1.0 still struggle with the multilingual visual text rendering task.

Glyph-ByT5-v2：正確な多言語視覚テキストレンダリングのための強力な美的ベースライン

Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering

要旨

Support