Glyph-ByT5:用于准确视觉文本渲染的定制文本编码器
Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering
March 14, 2024
作者: Zeyu Liu, Weicong Liang, Zhanhao Liang, Chong Luo, Ji Li, Gao Huang, Yuhui Yuan
cs.AI
摘要
当代文本到图像生成模型面临着视觉文本渲染的基本挑战,核心问题在于文本编码器的不足。为实现准确的文本渲染,我们确定了文本编码器的两个关键要求:字符感知和与字形的对齐。我们的解决方案涉及打造一系列定制文本编码器,Glyph-ByT5,通过微调具有字符感知能力的ByT5编码器,利用精心策划的配对字形-文本数据集。我们提出了一种有效的方法,将Glyph-ByT5与SDXL相结合,从而创建了用于设计图像生成的Glyph-SDXL模型。这显著提高了文本渲染的准确性,将其从不到20%提升至接近90%在我们的设计图像基准上。值得注意的是,Glyph-SDXL现在具有了文本段落渲染的能力,实现了对包含数十到数百个字符的自动多行布局的高拼写准确性。最后,通过对Glyph-SDXL进行微调,使用一小组高质量的照片级图像,展示了在开放域真实图像中的场景文本渲染能力显著提升。这些引人注目的结果旨在鼓励进一步探索,设计用于各种具有挑战性任务的定制文本编码器。
English
Visual text rendering poses a fundamental challenge for contemporary
text-to-image generation models, with the core problem lying in text encoder
deficiencies. To achieve accurate text rendering, we identify two crucial
requirements for text encoders: character awareness and alignment with glyphs.
Our solution involves crafting a series of customized text encoder, Glyph-ByT5,
by fine-tuning the character-aware ByT5 encoder using a meticulously curated
paired glyph-text dataset. We present an effective method for integrating
Glyph-ByT5 with SDXL, resulting in the creation of the Glyph-SDXL model for
design image generation. This significantly enhances text rendering accuracy,
improving it from less than 20% to nearly 90% on our design image
benchmark. Noteworthy is Glyph-SDXL's newfound ability for text paragraph
rendering, achieving high spelling accuracy for tens to hundreds of characters
with automated multi-line layouts. Finally, through fine-tuning Glyph-SDXL with
a small set of high-quality, photorealistic images featuring visual text, we
showcase a substantial improvement in scene text rendering capabilities in
open-domain real images. These compelling outcomes aim to encourage further
exploration in designing customized text encoders for diverse and challenging
tasks.Summary
AI-Generated Summary