野外视觉文本生成

摘要

最近，随着生成模型的快速发展，视觉文本生成领域取得了显著进展。然而，在现实场景中生成高质量文本图像仍然具有挑战性，因为需要满足三个关键标准：（1）保真度：生成的文本图像应该是照片般逼真，内容应与给定条件中指定的内容相同；（2）合理性：生成的文本的区域和内容应与场景相协调；（3）实用性：生成的文本图像可以促进相关任务（例如文本检测和识别）。经过调查，我们发现现有的方法，无论是基于渲染还是扩散的方法，都很难同时满足所有这些方面，限制了它们的应用范围。因此，我们在本文中提出了一种视觉文本生成器（称为SceneVTG），可以在实际场景中生成高质量的文本图像。SceneVTG遵循一个两阶段范式，利用多模态大型语言模型跨多个尺度和级别推荐合理的文本区域和内容，这些内容被条件扩散模型用作生成文本图像的条件。大量实验证明，所提出的SceneVTG在保真度和合理性方面明显优于传统的基于渲染的方法和最近的基于扩散的方法。此外，生成的图像在涉及文本检测和文本识别的任务中提供了更高的实用性。代码和数据集可在AdvancedLiterateMachinery上获得。

English

Recently, with the rapid advancements of generative models, the field of visual text generation has witnessed significant progress. However, it is still challenging to render high-quality text images in real-world scenarios, as three critical criteria should be satisfied: (1) Fidelity: the generated text images should be photo-realistic and the contents are expected to be the same as specified in the given conditions; (2) Reasonability: the regions and contents of the generated text should cohere with the scene; (3) Utility: the generated text images can facilitate related tasks (e.g., text detection and recognition). Upon investigation, we find that existing methods, either rendering-based or diffusion-based, can hardly meet all these aspects simultaneously, limiting their application range. Therefore, we propose in this paper a visual text generator (termed SceneVTG), which can produce high-quality text images in the wild. Following a two-stage paradigm, SceneVTG leverages a Multimodal Large Language Model to recommend reasonable text regions and contents across multiple scales and levels, which are used by a conditional diffusion model as conditions to generate text images. Extensive experiments demonstrate that the proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability. Besides, the generated images provide superior utility for tasks involving text detection and text recognition. Code and datasets are available at AdvancedLiterateMachinery.

野外视觉文本生成

Visual Text Generation in the Wild

摘要

Support