TextDiffuser: 텍스트 페인터로서의 확산 모델

초록

디퓨전 모델은 인상적인 생성 능력으로 점점 더 많은 관심을 받고 있지만, 현재까지는 정확하고 일관된 텍스트를 렌더링하는 데 어려움을 겪고 있습니다. 이 문제를 해결하기 위해, 우리는 배경과 조화를 이루는 시각적으로 매력적인 텍스트를 포함한 이미지를 생성하는 데 초점을 맞춘 TextDiffuser를 소개합니다. TextDiffuser는 두 단계로 구성됩니다: 첫째, 트랜스포머 모델이 텍스트 프롬프트에서 추출한 키워드의 레이아웃을 생성하고, 둘째, 디퓨전 모델이 텍스트 프롬프트와 생성된 레이아웃을 조건으로 이미지를 생성합니다. 또한, 우리는 OCR 주석이 포함된 최초의 대규모 텍스트 이미지 데이터셋인 MARIO-10M을 공개합니다. 이 데이터셋은 텍스트 인식, 탐지 및 문자 수준 분할 주석이 포함된 1천만 개의 이미지-텍스트 쌍으로 구성되어 있습니다. 더 나아가, 텍스트 렌더링 품질을 평가하기 위한 포괄적인 도구로 MARIO-Eval 벤치마크를 수집했습니다. 실험과 사용자 연구를 통해, TextDiffuser가 텍스트 프롬프트만으로 또는 텍스트 템플릿 이미지와 함께 사용하여 고품질의 텍스트 이미지를 생성할 수 있을 뿐만 아니라, 텍스트 인페인팅을 통해 텍스트가 포함된 불완전한 이미지를 재구성할 수 있는 유연성과 제어 가능성을 입증했습니다. 코드, 모델 및 데이터셋은 https://aka.ms/textdiffuser에서 확인할 수 있습니다.

English

Diffusion models have gained increasing attention for their impressive generation abilities but currently struggle with rendering accurate and coherent text. To address this issue, we introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds. TextDiffuser consists of two stages: first, a Transformer model generates the layout of keywords extracted from text prompts, and then diffusion models generate images conditioned on the text prompt and the generated layout. Additionally, we contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs with text recognition, detection, and character-level segmentation annotations. We further collect the MARIO-Eval benchmark to serve as a comprehensive tool for evaluating text rendering quality. Through experiments and user studies, we show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text. The code, model, and dataset will be available at https://aka.ms/textdiffuser.

TextDiffuser: 텍스트 페인터로서의 확산 모델

TextDiffuser: Diffusion Models as Text Painters

초록

Support