ワイルド環境における視覚的テキスト生成

要旨

近年、生成モデルの急速な進展に伴い、視覚的テキスト生成の分野は著しい進歩を遂げています。しかし、現実世界のシナリオにおいて高品質なテキスト画像を生成することは依然として困難であり、以下の3つの重要な基準を満たす必要があります：(1) 忠実性：生成されたテキスト画像は写真のようにリアルであり、指定された条件と内容が一致していること。(2) 合理性：生成されたテキストの領域と内容がシーンと調和していること。(3) 有用性：生成されたテキスト画像が関連するタスク（例えば、テキスト検出や認識）に役立つこと。調査の結果、既存の手法（レンダリングベースまたは拡散ベース）はこれらの側面を同時に満たすことが難しく、その応用範囲が制限されていることがわかりました。そこで本論文では、野外環境において高品質なテキスト画像を生成できる視覚的テキスト生成器（SceneVTG）を提案します。2段階のパラダイムに従い、SceneVTGはマルチモーダル大規模言語モデルを活用して、複数のスケールとレベルで合理的なテキスト領域と内容を推奨し、それらを条件として条件付き拡散モデルがテキスト画像を生成します。大規模な実験により、提案されたSceneVTGが、忠実性と合理性の点で従来のレンダリングベース手法や最近の拡散ベース手法を大幅に上回ることが実証されました。さらに、生成された画像は、テキスト検出や認識タスクにおいて優れた有用性を提供します。コードとデータセットはAdvancedLiterateMachineryで公開されています。

English

Recently, with the rapid advancements of generative models, the field of visual text generation has witnessed significant progress. However, it is still challenging to render high-quality text images in real-world scenarios, as three critical criteria should be satisfied: (1) Fidelity: the generated text images should be photo-realistic and the contents are expected to be the same as specified in the given conditions; (2) Reasonability: the regions and contents of the generated text should cohere with the scene; (3) Utility: the generated text images can facilitate related tasks (e.g., text detection and recognition). Upon investigation, we find that existing methods, either rendering-based or diffusion-based, can hardly meet all these aspects simultaneously, limiting their application range. Therefore, we propose in this paper a visual text generator (termed SceneVTG), which can produce high-quality text images in the wild. Following a two-stage paradigm, SceneVTG leverages a Multimodal Large Language Model to recommend reasonable text regions and contents across multiple scales and levels, which are used by a conditional diffusion model as conditions to generate text images. Extensive experiments demonstrate that the proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability. Besides, the generated images provide superior utility for tasks involving text detection and text recognition. Code and datasets are available at AdvancedLiterateMachinery.

ワイルド環境における視覚的テキスト生成

Visual Text Generation in the Wild

要旨

Support