EasyText：面向多语言文本渲染的可控扩散变换器

摘要

长期以来，利用扩散模型生成精确的多语言文本一直是人们追求的目标，但这一挑战依然存在。近期的方法在单一语言文本渲染方面取得了进展，然而任意语言的渲染仍是一个未被探索的领域。本文介绍了EasyText，一个基于DiT（扩散变换器）的文本渲染框架，该框架通过将去噪潜在空间与编码为字符标记的多语言字符标记相连接。我们提出了字符定位编码和位置编码插值技术，以实现可控且精确的文本渲染。此外，我们构建了一个包含100万条多语言图文标注的大规模合成文本图像数据集，以及一个包含2万张高质量标注图像的数据集，分别用于预训练和微调。广泛的实验与评估验证了我们的方法在多语言文本渲染、视觉质量及布局感知文本集成方面的有效性和先进性。

English

Generating accurate multilingual text with diffusion models has long been desired but remains challenging. Recent methods have made progress in rendering text in a single language, but rendering arbitrary languages is still an unexplored area. This paper introduces EasyText, a text rendering framework based on DiT (Diffusion Transformer), which connects denoising latents with multilingual character tokens encoded as character tokens. We propose character positioning encoding and position encoding interpolation techniques to achieve controllable and precise text rendering. Additionally, we construct a large-scale synthetic text image dataset with 1 million multilingual image-text annotations as well as a high-quality dataset of 20K annotated images, which are used for pretraining and fine-tuning respectively. Extensive experiments and evaluations demonstrate the effectiveness and advancement of our approach in multilingual text rendering, visual quality, and layout-aware text integration.

EasyText：面向多语言文本渲染的可控扩散变换器

EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering

摘要

Support