RepText：通过复制实现视觉文本渲染

摘要

尽管当代文本到图像生成模型在生成视觉吸引力强的图像方面取得了显著突破，但其在生成精确且灵活的排版元素，尤其是非拉丁字母方面，能力仍显不足。针对这些限制，我们从一个朴素假设出发，即文本理解仅是文本渲染的充分条件，而非必要条件。基于此，我们提出了RepText，旨在赋能预训练的单语文本到图像生成模型，使其能够准确渲染，更确切地说是复制，用户指定字体的多语言视觉文本，而无需真正理解这些文本。具体而言，我们借鉴了ControlNet的设置，并额外集成了语言无关的字形和渲染文本的位置信息，以生成协调的视觉文本，允许用户根据需求自定义文本内容、字体及位置。为提高准确性，我们在扩散损失之外还采用了文本感知损失。此外，为稳定渲染过程，在推理阶段，我们直接以带噪的字形潜变量初始化，而非随机初始化，并采用区域掩码将特征注入限制在文本区域，以避免背景失真。我们进行了大量实验，验证了RepText相较于现有工作的有效性，我们的方法超越了现有的开源方法，并达到了与原生多语言闭源模型相当的效果。为更加公正，我们也在最后详尽讨论了其局限性。

English

Although contemporary text-to-image generation models have achieved remarkable breakthroughs in producing visually appealing images, their capacity to generate precise and flexible typographic elements, especially non-Latin alphabets, remains constrained. To address these limitations, we start from an naive assumption that text understanding is only a sufficient condition for text rendering, but not a necessary condition. Based on this, we present RepText, which aims to empower pre-trained monolingual text-to-image generation models with the ability to accurately render, or more precisely, replicate, multilingual visual text in user-specified fonts, without the need to really understand them. Specifically, we adopt the setting from ControlNet and additionally integrate language agnostic glyph and position of rendered text to enable generating harmonized visual text, allowing users to customize text content, font and position on their needs. To improve accuracy, a text perceptual loss is employed along with the diffusion loss. Furthermore, to stabilize rendering process, at the inference phase, we directly initialize with noisy glyph latent instead of random initialization, and adopt region masks to restrict the feature injection to only the text region to avoid distortion of the background. We conducted extensive experiments to verify the effectiveness of our RepText relative to existing works, our approach outperforms existing open-source methods and achieves comparable results to native multi-language closed-source models. To be more fair, we also exhaustively discuss its limitations in the end.

RepText：通过复制实现视觉文本渲染

RepText: Rendering Visual Text via Replicating

摘要

Support