TextCrafter：在複雜視覺場景中精確渲染多重文本

摘要

本文探討了複雜視覺文本生成（CVTG）任務，該任務專注於在視覺圖像的不同區域內生成分佈複雜的文本內容。在CVTG中，圖像生成模型常常會呈現扭曲、模糊的視覺文本，或遺漏部分視覺文本。為應對這些挑戰，我們提出了TextCrafter，一種新穎的多視覺文本渲染方法。TextCrafter採用漸進策略，將複雜的視覺文本分解為不同的組成部分，同時確保文本內容與其視覺載體之間的穩健對齊。此外，它還引入了令牌聚焦增強機制，以在生成過程中提升視覺文本的顯著性。TextCrafter有效解決了CVTG任務中的關鍵挑戰，如文本混淆、遺漏和模糊等問題。此外，我們還提出了一個新的基準數據集CVTG-2K，專門用於嚴格評估生成模型在CVTG任務上的表現。大量實驗表明，我們的方法超越了現有的最先進技術。

English

This paper explores the task of Complex Visual Text Generation (CVTG), which centers on generating intricate textual content distributed across diverse regions within visual images. In CVTG, image generation models often rendering distorted and blurred visual text or missing some visual text. To tackle these challenges, we propose TextCrafter, a novel multi-visual text rendering method. TextCrafter employs a progressive strategy to decompose complex visual text into distinct components while ensuring robust alignment between textual content and its visual carrier. Additionally, it incorporates a token focus enhancement mechanism to amplify the prominence of visual text during the generation process. TextCrafter effectively addresses key challenges in CVTG tasks, such as text confusion, omissions, and blurriness. Moreover, we present a new benchmark dataset, CVTG-2K, tailored to rigorously evaluate the performance of generative models on CVTG tasks. Extensive experiments demonstrate that our method surpasses state-of-the-art approaches.