Glyph-ByT5-v2：準確多語言視覺文本呈現的強大美學基準

摘要

最近，Glyph-ByT5 在圖形設計圖像中實現了高度準確的視覺文本呈現性能。然而，它仍然僅專注於英文，在視覺吸引力方面表現相對較差。在這項工作中，我們通過提出 Glyph-ByT5-v2 和 Glyph-SDXL-v2 來解決這兩個基本限制，這兩者不僅支持 10 種不同語言的準確視覺文本呈現，還實現了更好的美學質量。為了實現這一目標，我們做出了以下貢獻：(i) 創建了一個高質量的多語言字形文本和圖形設計數據集，包括超過 100 萬個字形文本對和 1000 萬個圖形設計圖像文本對，涵蓋其他九種語言，(ii) 構建了一個多語言視覺段落基準測試，包括 1000 個提示，每種語言 100 個，用於評估多語言視覺拼寫準確性，以及(iii) 利用最新的步驟感知偏好學習方法來增強視覺美學質量。通過這些技術的結合，我們提供了一個強大的定制多語言文本編碼器 Glyph-ByT5-v2，以及一個強大的美學圖形生成模型 Glyph-SDXL-v2，可以支持 10 種不同語言的準確拼寫。考慮到最新的 DALL-E3 和 Ideogram 1.0 仍然在多語言視覺文本呈現任勞任怨，我們認為我們的工作是一個重大的進步。

English

Recently, Glyph-ByT5 has achieved highly accurate visual text rendering performance in graphic design images. However, it still focuses solely on English and performs relatively poorly in terms of visual appeal. In this work, we address these two fundamental limitations by presenting Glyph-ByT5-v2 and Glyph-SDXL-v2, which not only support accurate visual text rendering for 10 different languages but also achieve much better aesthetic quality. To achieve this, we make the following contributions: (i) creating a high-quality multilingual glyph-text and graphic design dataset consisting of more than 1 million glyph-text pairs and 10 million graphic design image-text pairs covering nine other languages, (ii) building a multilingual visual paragraph benchmark consisting of 1,000 prompts, with 100 for each language, to assess multilingual visual spelling accuracy, and (iii) leveraging the latest step-aware preference learning approach to enhance the visual aesthetic quality. With the combination of these techniques, we deliver a powerful customized multilingual text encoder, Glyph-ByT5-v2, and a strong aesthetic graphic generation model, Glyph-SDXL-v2, that can support accurate spelling in 10 different languages. We perceive our work as a significant advancement, considering that the latest DALL-E3 and Ideogram 1.0 still struggle with the multilingual visual text rendering task.

Glyph-ByT5-v2：準確多語言視覺文本呈現的強大美學基準

Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering

摘要

Support