RepText:通过复制实现视觉文本渲染
RepText: Rendering Visual Text via Replicating
April 28, 2025
作者: Haofan Wang, Yujia Xu, Yimeng Li, Junchen Li, Chaowei Zhang, Jing Wang, Kejia Yang, Zhibo Chen
cs.AI
摘要
尽管当代文本到图像生成模型在生成视觉吸引力强的图像方面取得了显著突破,但其在生成精确且灵活的排版元素,尤其是非拉丁字母方面,能力仍显不足。针对这些限制,我们从一个朴素假设出发,即文本理解仅是文本渲染的充分条件,而非必要条件。基于此,我们提出了RepText,旨在赋能预训练的单语文本到图像生成模型,使其能够准确渲染,更确切地说是复制,用户指定字体的多语言视觉文本,而无需真正理解这些文本。具体而言,我们借鉴了ControlNet的设置,并额外集成了语言无关的字形和渲染文本的位置信息,以生成协调的视觉文本,允许用户根据需求自定义文本内容、字体及位置。为提高准确性,我们在扩散损失之外还采用了文本感知损失。此外,为稳定渲染过程,在推理阶段,我们直接以带噪的字形潜变量初始化,而非随机初始化,并采用区域掩码将特征注入限制在文本区域,以避免背景失真。我们进行了大量实验,验证了RepText相较于现有工作的有效性,我们的方法超越了现有的开源方法,并达到了与原生多语言闭源模型相当的效果。为更加公正,我们也在最后详尽讨论了其局限性。
English
Although contemporary text-to-image generation models have achieved
remarkable breakthroughs in producing visually appealing images, their capacity
to generate precise and flexible typographic elements, especially non-Latin
alphabets, remains constrained. To address these limitations, we start from an
naive assumption that text understanding is only a sufficient condition for
text rendering, but not a necessary condition. Based on this, we present
RepText, which aims to empower pre-trained monolingual text-to-image generation
models with the ability to accurately render, or more precisely, replicate,
multilingual visual text in user-specified fonts, without the need to really
understand them. Specifically, we adopt the setting from ControlNet and
additionally integrate language agnostic glyph and position of rendered text to
enable generating harmonized visual text, allowing users to customize text
content, font and position on their needs. To improve accuracy, a text
perceptual loss is employed along with the diffusion loss. Furthermore, to
stabilize rendering process, at the inference phase, we directly initialize
with noisy glyph latent instead of random initialization, and adopt region
masks to restrict the feature injection to only the text region to avoid
distortion of the background. We conducted extensive experiments to verify the
effectiveness of our RepText relative to existing works, our approach
outperforms existing open-source methods and achieves comparable results to
native multi-language closed-source models. To be more fair, we also
exhaustively discuss its limitations in the end.Summary
AI-Generated Summary