Glyph-ByT5-v2: 정확한 다국어 시각적 텍스트 렌더링을 위한 강력한 미적 기준 모델

초록

최근 Glyph-ByT5는 그래픽 디자인 이미지에서 매우 정확한 시각적 텍스트 렌더링 성능을 달성했습니다. 그러나 여전히 영어에만 초점을 맞추고 있으며 시각적 매력 측면에서는 상대적으로 낮은 성능을 보입니다. 본 연구에서는 Glyph-ByT5-v2와 Glyph-SDXL-v2를 제시하여 이러한 두 가지 근본적인 한계를 해결하고자 합니다. 이 모델들은 10가지 다른 언어에 대해 정확한 시각적 텍스트 렌더링을 지원할 뿐만 아니라 훨씬 더 나은 미적 품질을 달성합니다. 이를 위해 다음과 같은 기여를 합니다: (i) 100만 개 이상의 글리프-텍스트 쌍과 9개 다른 언어를 포함하는 1,000만 개의 그래픽 디자인 이미지-텍스트 쌍으로 구성된 고품질 다국어 글리프-텍스트 및 그래픽 디자인 데이터셋을 생성, (ii) 각 언어별로 100개씩 총 1,000개의 프롬프트로 구성된 다국어 시각적 문단 벤치마크를 구축하여 다국어 시각적 철자 정확도를 평가, (iii) 최신 단계 인식 선호 학습 접근법을 활용하여 시각적 미적 품질을 향상. 이러한 기술들을 결합하여, 우리는 강력한 맞춤형 다국어 텍스트 인코더인 Glyph-ByT5-v2와 10가지 다른 언어에서 정확한 철자를 지원할 수 있는 강력한 미적 그래픽 생성 모델인 Glyph-SDXL-v2를 제공합니다. 최신 DALL-E3와 Ideogram 1.0이 여전히 다국어 시각적 텍스트 렌더링 작업에 어려움을 겪고 있는 점을 고려할 때, 본 연구는 중요한 진전으로 간주됩니다.

English

Recently, Glyph-ByT5 has achieved highly accurate visual text rendering performance in graphic design images. However, it still focuses solely on English and performs relatively poorly in terms of visual appeal. In this work, we address these two fundamental limitations by presenting Glyph-ByT5-v2 and Glyph-SDXL-v2, which not only support accurate visual text rendering for 10 different languages but also achieve much better aesthetic quality. To achieve this, we make the following contributions: (i) creating a high-quality multilingual glyph-text and graphic design dataset consisting of more than 1 million glyph-text pairs and 10 million graphic design image-text pairs covering nine other languages, (ii) building a multilingual visual paragraph benchmark consisting of 1,000 prompts, with 100 for each language, to assess multilingual visual spelling accuracy, and (iii) leveraging the latest step-aware preference learning approach to enhance the visual aesthetic quality. With the combination of these techniques, we deliver a powerful customized multilingual text encoder, Glyph-ByT5-v2, and a strong aesthetic graphic generation model, Glyph-SDXL-v2, that can support accurate spelling in 10 different languages. We perceive our work as a significant advancement, considering that the latest DALL-E3 and Ideogram 1.0 still struggle with the multilingual visual text rendering task.

Glyph-ByT5-v2: 정확한 다국어 시각적 텍스트 렌더링을 위한 강력한 미적 기준 모델

Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering

초록

Support