TextCrafter: 복잡한 시각적 장면에서 다중 텍스트를 정확하게 렌더링하기

초록

본 논문은 시각적 이미지 내 다양한 영역에 분포된 복잡한 텍스트 콘텐츠를 생성하는 과제인 복합 시각 텍스트 생성(Complex Visual Text Generation, CVTG)을 탐구한다. CVTG에서 이미지 생성 모델은 종종 왜곡되고 흐릿한 시각적 텍스트를 렌더링하거나 일부 시각적 텍스트를 누락시키는 문제를 보인다. 이러한 문제를 해결하기 위해, 우리는 새로운 다중 시각 텍스트 렌더링 방법인 TextCrafter를 제안한다. TextCrafter는 복잡한 시각 텍스트를 별개의 구성 요소로 분해하면서 텍스트 콘텐츠와 시각적 매체 간의 견고한 정렬을 보장하는 점진적 전략을 채택한다. 또한, 생성 과정에서 시각적 텍스트의 두드러짐을 강화하기 위해 토큰 포커스 강화 메커니즘을 통합한다. TextCrafter는 텍스트 혼동, 누락, 흐릿함과 같은 CVTG 과제의 주요 문제를 효과적으로 해결한다. 더불어, CVTG 과제에서 생성 모델의 성능을 엄격히 평가하기 위해 새로운 벤치마크 데이터셋인 CVTG-2K를 제시한다. 광범위한 실험을 통해 우리의 방법이 최신 기술을 능가함을 입증한다.

English

This paper explores the task of Complex Visual Text Generation (CVTG), which centers on generating intricate textual content distributed across diverse regions within visual images. In CVTG, image generation models often rendering distorted and blurred visual text or missing some visual text. To tackle these challenges, we propose TextCrafter, a novel multi-visual text rendering method. TextCrafter employs a progressive strategy to decompose complex visual text into distinct components while ensuring robust alignment between textual content and its visual carrier. Additionally, it incorporates a token focus enhancement mechanism to amplify the prominence of visual text during the generation process. TextCrafter effectively addresses key challenges in CVTG tasks, such as text confusion, omissions, and blurriness. Moreover, we present a new benchmark dataset, CVTG-2K, tailored to rigorously evaluate the performance of generative models on CVTG tasks. Extensive experiments demonstrate that our method surpasses state-of-the-art approaches.

TextCrafter: 복잡한 시각적 장면에서 다중 텍스트를 정확하게 렌더링하기

TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes

초록

Support