TextDiffuser：以擴散模型為文字繪師

摘要

擴散模型因其出色的生成能力而受到越來越多的關注，但目前在呈現準確和連貫的文本方面仍存在困難。為解決這個問題，我們引入了TextDiffuser，專注於生成具有視覺吸引力且與背景連貫的文本圖像。TextDiffuser包括兩個階段：首先，一個Transformer模型生成從文本提示中提取的關鍵詞的佈局，然後擴散模型生成以文本提示和生成的佈局為條件的圖像。此外，我們貢獻了第一個帶有OCR標註的大規模文本圖像數據集MARIO-10M，其中包含1000萬個圖像文本對，具有文本識別、檢測和字符級別分割標註。我們進一步收集了MARIO-Eval基準測試集，作為評估文本呈現質量的綜合工具。通過實驗和用戶研究，我們展示了TextDiffuser是靈活且可控的，可以僅使用文本提示或與文本模板圖像一起創建高質量的文本圖像，並進行文本修補以重建帶有文本的不完整圖像。代碼、模型和數據集將在https://aka.ms/textdiffuser 上提供。

English

Diffusion models have gained increasing attention for their impressive generation abilities but currently struggle with rendering accurate and coherent text. To address this issue, we introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds. TextDiffuser consists of two stages: first, a Transformer model generates the layout of keywords extracted from text prompts, and then diffusion models generate images conditioned on the text prompt and the generated layout. Additionally, we contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs with text recognition, detection, and character-level segmentation annotations. We further collect the MARIO-Eval benchmark to serve as a comprehensive tool for evaluating text rendering quality. Through experiments and user studies, we show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text. The code, model, and dataset will be available at https://aka.ms/textdiffuser.

TextDiffuser：以擴散模型為文字繪師

TextDiffuser: Diffusion Models as Text Painters

摘要

Support