GlyphControl: 시각적 텍스트 생성을 위한 글리프 조건부 제어

초록

최근, 일관성 있고 잘 구성된 시각적 텍스트를 생성할 수 있는 확산 기반 텍스트-이미지 생성 모델 개발에 대한 관심이 증가하고 있다. 본 논문에서는 이러한 과제를 해결하기 위해 GlyphControl이라는 새로운 효율적인 접근 방식을 제안한다. ByT5와 같은 문자 인식 텍스트 인코더에 의존하고 텍스트-이미지 모델의 재학습을 요구하는 기존 방법과 달리, 우리의 접근 방식은 추가적인 글리프 조건 정보를 활용하여 오프더셸프 Stable-Diffusion 모델의 성능을 향상시켜 정확한 시각적 텍스트를 생성한다. 글리프 지침을 통합함으로써 사용자는 생성된 텍스트의 내용, 위치, 크기를 특정 요구 사항에 맞게 사용자 정의할 수 있다. 시각적 텍스트 생성에 대한 추가 연구를 촉진하기 위해 LAION-Glyph라는 훈련 벤치마크 데이터셋을 구축하였다. 생성된 시각적 텍스트의 OCR 기반 지표와 CLIP 점수를 측정하여 우리의 접근 방식의 효과를 평가하였다. 실험적 평가 결과, GlyphControl은 최근의 DeepFloyd IF 접근 방식보다 OCR 정확도와 CLIP 점수 측면에서 우수한 성능을 보여 우리 방법의 효용성을 입증하였다.

English

Recently, there has been a growing interest in developing diffusion-based text-to-image generative models capable of generating coherent and well-formed visual text. In this paper, we propose a novel and efficient approach called GlyphControl to address this task. Unlike existing methods that rely on character-aware text encoders like ByT5 and require retraining of text-to-image models, our approach leverages additional glyph conditional information to enhance the performance of the off-the-shelf Stable-Diffusion model in generating accurate visual text. By incorporating glyph instructions, users can customize the content, location, and size of the generated text according to their specific requirements. To facilitate further research in visual text generation, we construct a training benchmark dataset called LAION-Glyph. We evaluate the effectiveness of our approach by measuring OCR-based metrics and CLIP scores of the generated visual text. Our empirical evaluations demonstrate that GlyphControl outperforms the recent DeepFloyd IF approach in terms of OCR accuracy and CLIP scores, highlighting the efficacy of our method.

GlyphControl: 시각적 텍스트 생성을 위한 글리프 조건부 제어

GlyphControl: Glyph Conditional Control for Visual Text Generation

초록

Support