RepText: レプリケーションによる視覚的テキストのレンダリング

要旨

現代のテキストから画像生成モデルは、視覚的に魅力的な画像を生成する点で目覚ましい進歩を遂げていますが、特に非ラテン文字を含む正確で柔軟なタイポグラフィ要素を生成する能力は依然として限られています。これらの制約に対処するため、我々はテキスト理解がテキストレンダリングの十分条件ではあるが必要条件ではないという素朴な仮定から出発します。これに基づき、我々はRepTextを提案します。RepTextは、事前学習された単一言語テキストから画像生成モデルに、ユーザー指定のフォントで多言語の視覚的テキストを正確にレンダリング、より正確には複製する能力を付与することを目指しており、実際にテキストを理解する必要はありません。具体的には、ControlNetの設定を採用し、さらに言語に依存しないグリフとレンダリングされたテキストの位置を統合して、調和の取れた視覚的テキストの生成を可能にし、ユーザーがテキスト内容、フォント、位置を必要に応じてカスタマイズできるようにします。精度を向上させるために、拡散損失とともにテキスト知覚損失を採用しています。さらに、レンダリングプロセスを安定化させるため、推論フェーズではランダム初期化ではなくノイジーなグリフ潜在変数を直接初期化し、背景の歪みを避けるためにテキスト領域のみに特徴注入を制限するための領域マスクを採用します。我々は、既存の研究に対するRepTextの有効性を検証するために広範な実験を行い、我々のアプローチが既存のオープンソース手法を上回り、ネイティブの多言語クローズドソースモデルと同等の結果を達成することを確認しました。より公平を期すため、最後にその限界についても徹底的に議論しています。

English

Although contemporary text-to-image generation models have achieved remarkable breakthroughs in producing visually appealing images, their capacity to generate precise and flexible typographic elements, especially non-Latin alphabets, remains constrained. To address these limitations, we start from an naive assumption that text understanding is only a sufficient condition for text rendering, but not a necessary condition. Based on this, we present RepText, which aims to empower pre-trained monolingual text-to-image generation models with the ability to accurately render, or more precisely, replicate, multilingual visual text in user-specified fonts, without the need to really understand them. Specifically, we adopt the setting from ControlNet and additionally integrate language agnostic glyph and position of rendered text to enable generating harmonized visual text, allowing users to customize text content, font and position on their needs. To improve accuracy, a text perceptual loss is employed along with the diffusion loss. Furthermore, to stabilize rendering process, at the inference phase, we directly initialize with noisy glyph latent instead of random initialization, and adopt region masks to restrict the feature injection to only the text region to avoid distortion of the background. We conducted extensive experiments to verify the effectiveness of our RepText relative to existing works, our approach outperforms existing open-source methods and achieves comparable results to native multi-language closed-source models. To be more fair, we also exhaustively discuss its limitations in the end.

RepText: レプリケーションによる視覚的テキストのレンダリング

RepText: Rendering Visual Text via Replicating

要旨

Support