EasyText: 다국어 텍스트 렌더링을 위한 제어 가능한 확산 트랜스포머

초록

정확한 다국어 텍스트 생성을 위한 확산 모델의 개발은 오랫동안 요구되어 왔지만 여전히 도전적인 과제로 남아 있다. 최근의 방법들은 단일 언어로 텍스트를 렌더링하는 데 있어 진전을 이루었지만, 임의의 언어를 렌더링하는 것은 아직 탐구되지 않은 영역이다. 본 논문은 DiT(Diffusion Transformer)를 기반으로 한 EasyText 텍스트 렌더링 프레임워크를 소개하며, 이는 잡음 제거 잠재 공간을 다국어 문자 토큰으로 인코딩된 문자 토큰과 연결한다. 우리는 제어 가능하고 정확한 텍스트 렌더링을 달성하기 위해 문자 위치 인코딩 및 위치 인코딩 보간 기술을 제안한다. 또한, 100만 개의 다국어 이미지-텍스트 주석으로 구성된 대규모 합성 텍스트 이미지 데이터셋과 20,000개의 고품질 주석 이미지 데이터셋을 구축하여 각각 사전 학습과 미세 조정에 사용하였다. 광범위한 실험과 평가를 통해 우리의 접근 방식이 다국어 텍스트 렌더링, 시각적 품질, 레이아웃 인식 텍스트 통합에서의 효과성과 진보성을 입증하였다.

English

Generating accurate multilingual text with diffusion models has long been desired but remains challenging. Recent methods have made progress in rendering text in a single language, but rendering arbitrary languages is still an unexplored area. This paper introduces EasyText, a text rendering framework based on DiT (Diffusion Transformer), which connects denoising latents with multilingual character tokens encoded as character tokens. We propose character positioning encoding and position encoding interpolation techniques to achieve controllable and precise text rendering. Additionally, we construct a large-scale synthetic text image dataset with 1 million multilingual image-text annotations as well as a high-quality dataset of 20K annotated images, which are used for pretraining and fine-tuning respectively. Extensive experiments and evaluations demonstrate the effectiveness and advancement of our approach in multilingual text rendering, visual quality, and layout-aware text integration.

EasyText: 다국어 텍스트 렌더링을 위한 제어 가능한 확산 트랜스포머

EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering

초록

Support