TextDiffuser:以擴散模型為文字繪師
TextDiffuser: Diffusion Models as Text Painters
May 18, 2023
作者: Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, Furu Wei
cs.AI
摘要
擴散模型因其出色的生成能力而受到越來越多的關注,但目前在呈現準確和連貫的文本方面仍存在困難。為解決這個問題,我們引入了TextDiffuser,專注於生成具有視覺吸引力且與背景連貫的文本圖像。TextDiffuser包括兩個階段:首先,一個Transformer模型生成從文本提示中提取的關鍵詞的佈局,然後擴散模型生成以文本提示和生成的佈局為條件的圖像。此外,我們貢獻了第一個帶有OCR標註的大規模文本圖像數據集MARIO-10M,其中包含1000萬個圖像文本對,具有文本識別、檢測和字符級別分割標註。我們進一步收集了MARIO-Eval基準測試集,作為評估文本呈現質量的綜合工具。通過實驗和用戶研究,我們展示了TextDiffuser是靈活且可控的,可以僅使用文本提示或與文本模板圖像一起創建高質量的文本圖像,並進行文本修補以重建帶有文本的不完整圖像。代碼、模型和數據集將在https://aka.ms/textdiffuser 上提供。
English
Diffusion models have gained increasing attention for their impressive
generation abilities but currently struggle with rendering accurate and
coherent text. To address this issue, we introduce TextDiffuser,
focusing on generating images with visually appealing text that is coherent
with backgrounds. TextDiffuser consists of two stages: first, a Transformer
model generates the layout of keywords extracted from text prompts, and then
diffusion models generate images conditioned on the text prompt and the
generated layout. Additionally, we contribute the first large-scale text images
dataset with OCR annotations, MARIO-10M, containing 10 million
image-text pairs with text recognition, detection, and character-level
segmentation annotations. We further collect the MARIO-Eval benchmark
to serve as a comprehensive tool for evaluating text rendering quality. Through
experiments and user studies, we show that TextDiffuser is flexible and
controllable to create high-quality text images using text prompts alone or
together with text template images, and conduct text inpainting to reconstruct
incomplete images with text. The code, model, and dataset will be available at
https://aka.ms/textdiffuser.