野外視覺文本生成

摘要

最近，隨著生成模型的快速進展，視覺文本生成領域取得了顯著進步。然而，在現實場景中呈現高質量文本圖像仍然具有挑戰性，因為需要滿足三個關鍵標準：（1）保真度：生成的文本圖像應該是照片般逼真，內容應與給定條件中指定的內容相同；（2）合理性：生成的文本區域和內容應與場景一致；（3）實用性：生成的文本圖像應有助於相關任務（例如文本檢測和識別）。經過調查，我們發現現有的方法，無論是基於渲染還是擴散的方法，都很難同時滿足所有這些方面，限制了它們的應用範圍。因此，在本文中，我們提出了一種視覺文本生成器（稱為SceneVTG），可以在實際環境中生成高質量的文本圖像。SceneVTG採用了雙階段範式，利用多模態大型語言模型跨多個尺度和層次推薦合理的文本區域和內容，這些被條件擴散模型用作生成文本圖像的條件。大量實驗表明，所提出的SceneVTG在保真度和合理性方面顯著優於傳統基於渲染的方法和最近的基於擴散的方法。此外，生成的圖像對涉及文本檢測和文本識別的任務提供了更優的實用性。代碼和數據集可在AdvancedLiterateMachinery上獲得。

English

Recently, with the rapid advancements of generative models, the field of visual text generation has witnessed significant progress. However, it is still challenging to render high-quality text images in real-world scenarios, as three critical criteria should be satisfied: (1) Fidelity: the generated text images should be photo-realistic and the contents are expected to be the same as specified in the given conditions; (2) Reasonability: the regions and contents of the generated text should cohere with the scene; (3) Utility: the generated text images can facilitate related tasks (e.g., text detection and recognition). Upon investigation, we find that existing methods, either rendering-based or diffusion-based, can hardly meet all these aspects simultaneously, limiting their application range. Therefore, we propose in this paper a visual text generator (termed SceneVTG), which can produce high-quality text images in the wild. Following a two-stage paradigm, SceneVTG leverages a Multimodal Large Language Model to recommend reasonable text regions and contents across multiple scales and levels, which are used by a conditional diffusion model as conditions to generate text images. Extensive experiments demonstrate that the proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability. Besides, the generated images provide superior utility for tasks involving text detection and text recognition. Code and datasets are available at AdvancedLiterateMachinery.

野外視覺文本生成

Visual Text Generation in the Wild

摘要

Support