TextCrafter: 複雑な視覚シーンにおける複数テキストの正確なレンダリング

要旨

本論文では、視覚画像内の多様な領域に分散した複雑なテキストコンテンツを生成するタスクであるComplex Visual Text Generation（CVTG）を探求する。CVTGにおいて、画像生成モデルはしばしば歪んだりぼやけた視覚テキストを生成したり、一部の視覚テキストを欠落させたりする。これらの課題に対処するため、我々は新しいマルチ視覚テキストレンダリング手法であるTextCrafterを提案する。TextCrafterは、複雑な視覚テキストを個別のコンポーネントに分解しつつ、テキストコンテンツとその視覚的キャリアとの堅牢な整合性を確保するための漸進的戦略を採用する。さらに、生成プロセス中に視覚テキストの顕著性を増幅するためのトークンフォーカス強化メカニズムを組み込んでいる。TextCrafterは、テキストの混乱、欠落、ぼやけといったCVTGタスクの主要な課題に効果的に対処する。また、CVTGタスクにおける生成モデルの性能を厳密に評価するために、新しいベンチマークデータセットCVTG-2Kを提示する。広範な実験により、我々の手法が最先端のアプローチを凌駕することが実証された。

English

This paper explores the task of Complex Visual Text Generation (CVTG), which centers on generating intricate textual content distributed across diverse regions within visual images. In CVTG, image generation models often rendering distorted and blurred visual text or missing some visual text. To tackle these challenges, we propose TextCrafter, a novel multi-visual text rendering method. TextCrafter employs a progressive strategy to decompose complex visual text into distinct components while ensuring robust alignment between textual content and its visual carrier. Additionally, it incorporates a token focus enhancement mechanism to amplify the prominence of visual text during the generation process. TextCrafter effectively addresses key challenges in CVTG tasks, such as text confusion, omissions, and blurriness. Moreover, we present a new benchmark dataset, CVTG-2K, tailored to rigorously evaluate the performance of generative models on CVTG tasks. Extensive experiments demonstrate that our method surpasses state-of-the-art approaches.

TextCrafter: 複雑な視覚シーンにおける複数テキストの正確なレンダリング

TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes

要旨

Support