RepText: 복제를 통한 시각적 텍스트 렌더링

초록

현대의 텍스트-이미지 생성 모델들은 시각적으로 매력적인 이미지를 생성하는 데 있어서 놀라운 발전을 이루었지만, 정확하고 유연한 타이포그래피 요소, 특히 비라틴 문자를 생성하는 능력은 여전히 제한적입니다. 이러한 한계를 해결하기 위해, 우리는 텍스트 이해가 텍스트 렌더링을 위한 충분 조건이지만 필수 조건은 아니라는 단순한 가정에서 출발합니다. 이를 바탕으로, 우리는 사전 훈련된 단일 언어 텍스트-이미지 생성 모델이 사용자가 지정한 폰트로 다국어 시각적 텍스트를 정확하게 렌더링하거나 더 정확히 말해 복제할 수 있도록 하는 RepText를 제안합니다. 구체적으로, 우리는 ControlNet의 설정을 채택하고, 추가적으로 언어에 구애받지 않는 글리프와 렌더링된 텍스트의 위치를 통합하여 조화로운 시각적 텍스트를 생성할 수 있도록 하여 사용자가 필요에 따라 텍스트 내용, 폰트 및 위치를 사용자 정의할 수 있게 합니다. 정확도를 높이기 위해, 확산 손실과 함께 텍스트 지각 손실을 사용합니다. 또한, 렌더링 과정을 안정화하기 위해 추론 단계에서 무작위 초기화 대신 노이즈가 있는 글리프 잠재 변수로 직접 초기화하고, 배경의 왜곡을 방지하기 위해 텍스트 영역에만 특징 주입을 제한하는 영역 마스크를 채택합니다. 우리는 기존 연구에 비해 RepText의 효과를 검증하기 위해 광범위한 실험을 수행했으며, 우리의 접근 방식은 기존의 오픈소스 방법들을 능가하고, 네이티브 다국어 폐쇄형 모델과 비슷한 결과를 달성했습니다. 더 공정한 평가를 위해, 마지막에 그 한계에 대해 철저히 논의합니다.

English

Although contemporary text-to-image generation models have achieved remarkable breakthroughs in producing visually appealing images, their capacity to generate precise and flexible typographic elements, especially non-Latin alphabets, remains constrained. To address these limitations, we start from an naive assumption that text understanding is only a sufficient condition for text rendering, but not a necessary condition. Based on this, we present RepText, which aims to empower pre-trained monolingual text-to-image generation models with the ability to accurately render, or more precisely, replicate, multilingual visual text in user-specified fonts, without the need to really understand them. Specifically, we adopt the setting from ControlNet and additionally integrate language agnostic glyph and position of rendered text to enable generating harmonized visual text, allowing users to customize text content, font and position on their needs. To improve accuracy, a text perceptual loss is employed along with the diffusion loss. Furthermore, to stabilize rendering process, at the inference phase, we directly initialize with noisy glyph latent instead of random initialization, and adopt region masks to restrict the feature injection to only the text region to avoid distortion of the background. We conducted extensive experiments to verify the effectiveness of our RepText relative to existing works, our approach outperforms existing open-source methods and achieves comparable results to native multi-language closed-source models. To be more fair, we also exhaustively discuss its limitations in the end.

RepText: 복제를 통한 시각적 텍스트 렌더링

RepText: Rendering Visual Text via Replicating

초록

Support