EasyText:面向多语言文本渲染的可控扩散变换器
EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering
May 30, 2025
作者: Runnan Lu, Yuxuan Zhang, Jailing Liu, Haifa Wang, Yiren Song
cs.AI
摘要
长期以来,利用扩散模型生成精确的多语言文本一直是人们追求的目标,但这一挑战依然存在。近期的方法在单一语言文本渲染方面取得了进展,然而任意语言的渲染仍是一个未被探索的领域。本文介绍了EasyText,一个基于DiT(扩散变换器)的文本渲染框架,该框架通过将去噪潜在空间与编码为字符标记的多语言字符标记相连接。我们提出了字符定位编码和位置编码插值技术,以实现可控且精确的文本渲染。此外,我们构建了一个包含100万条多语言图文标注的大规模合成文本图像数据集,以及一个包含2万张高质量标注图像的数据集,分别用于预训练和微调。广泛的实验与评估验证了我们的方法在多语言文本渲染、视觉质量及布局感知文本集成方面的有效性和先进性。
English
Generating accurate multilingual text with diffusion models has long been
desired but remains challenging. Recent methods have made progress in rendering
text in a single language, but rendering arbitrary languages is still an
unexplored area. This paper introduces EasyText, a text rendering framework
based on DiT (Diffusion Transformer), which connects denoising latents with
multilingual character tokens encoded as character tokens. We propose character
positioning encoding and position encoding interpolation techniques to achieve
controllable and precise text rendering. Additionally, we construct a
large-scale synthetic text image dataset with 1 million multilingual image-text
annotations as well as a high-quality dataset of 20K annotated images, which
are used for pretraining and fine-tuning respectively. Extensive experiments
and evaluations demonstrate the effectiveness and advancement of our approach
in multilingual text rendering, visual quality, and layout-aware text
integration.Summary
AI-Generated Summary