TextDiffuser：テキストペインターとしての拡散モデル

要旨

拡散モデルはその印象的な生成能力で注目を集めていますが、現状では正確で一貫性のあるテキストのレンダリングに課題を抱えています。この問題に対処するため、私たちはTextDiffuserを提案します。TextDiffuserは、背景と調和した視覚的に魅力的なテキストを含む画像の生成に焦点を当てています。TextDiffuserは2段階で構成されます：まず、Transformerモデルがテキストプロンプトから抽出されたキーワードのレイアウトを生成し、次に拡散モデルがテキストプロンプトと生成されたレイアウトに基づいて画像を生成します。さらに、OCRアノテーション付きの最初の大規模テキスト画像データセットであるMARIO-10Mを提供します。これは、テキスト認識、検出、文字レベルのセグメンテーションアノテーションを含む1000万の画像-テキストペアで構成されています。また、テキストレンダリング品質を評価するための包括的なツールとしてMARIO-Evalベンチマークを収集しました。実験とユーザー調査を通じて、TextDiffuserがテキストプロンプト単体またはテキストテンプレート画像と組み合わせて高品質のテキスト画像を作成する柔軟性と制御性を備えていること、および不完全な画像をテキストで再構築するテキストインペインティングを実行できることを示します。コード、モデル、データセットはhttps://aka.ms/textdiffuserで公開されます。

English

Diffusion models have gained increasing attention for their impressive generation abilities but currently struggle with rendering accurate and coherent text. To address this issue, we introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds. TextDiffuser consists of two stages: first, a Transformer model generates the layout of keywords extracted from text prompts, and then diffusion models generate images conditioned on the text prompt and the generated layout. Additionally, we contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs with text recognition, detection, and character-level segmentation annotations. We further collect the MARIO-Eval benchmark to serve as a comprehensive tool for evaluating text rendering quality. Through experiments and user studies, we show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text. The code, model, and dataset will be available at https://aka.ms/textdiffuser.

TextDiffuser：テキストペインターとしての拡散モデル

TextDiffuser: Diffusion Models as Text Painters

要旨

Support