易文本：可控扩散变换器用于多语言文本渲染

摘要

生成精確的多語言文本一直是擴散模型領域長期以來的追求，但至今仍面臨挑戰。近期方法在單一語言文本渲染方面取得了進展，然而，任意語言的渲染仍是一個未經充分探索的領域。本文介紹了EasyText，一個基於DiT（擴散變壓器）的文本渲染框架，該框架將去噪潛變量與編碼為字符標記的多語言字符標記相連接。我們提出了字符定位編碼與位置編碼插值技術，以實現可控且精確的文本渲染。此外，我們構建了一個包含100萬條多語言圖像-文本註釋的大規模合成文本圖像數據集，以及一個包含20K高質量註釋圖像的數據集，分別用於預訓練和微調。廣泛的實驗與評估證明了我們方法在多語言文本渲染、視覺質量及佈局感知文本集成方面的有效性和先進性。

English

Generating accurate multilingual text with diffusion models has long been desired but remains challenging. Recent methods have made progress in rendering text in a single language, but rendering arbitrary languages is still an unexplored area. This paper introduces EasyText, a text rendering framework based on DiT (Diffusion Transformer), which connects denoising latents with multilingual character tokens encoded as character tokens. We propose character positioning encoding and position encoding interpolation techniques to achieve controllable and precise text rendering. Additionally, we construct a large-scale synthetic text image dataset with 1 million multilingual image-text annotations as well as a high-quality dataset of 20K annotated images, which are used for pretraining and fine-tuning respectively. Extensive experiments and evaluations demonstrate the effectiveness and advancement of our approach in multilingual text rendering, visual quality, and layout-aware text integration.

易文本：可控扩散变换器用于多语言文本渲染

EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering

摘要

Support