野外視覺文本生成
Visual Text Generation in the Wild
July 19, 2024
作者: Yuanzhi Zhu, Jiawei Liu, Feiyu Gao, Wenyu Liu, Xinggang Wang, Peng Wang, Fei Huang, Cong Yao, Zhibo Yang
cs.AI
摘要
最近,隨著生成模型的快速進展,視覺文本生成領域取得了顯著進步。然而,在現實場景中呈現高質量文本圖像仍然具有挑戰性,因為需要滿足三個關鍵標準:(1)保真度:生成的文本圖像應該是照片般逼真,內容應與給定條件中指定的內容相同;(2)合理性:生成的文本區域和內容應與場景一致;(3)實用性:生成的文本圖像應有助於相關任務(例如文本檢測和識別)。經過調查,我們發現現有的方法,無論是基於渲染還是擴散的方法,都很難同時滿足所有這些方面,限制了它們的應用範圍。因此,在本文中,我們提出了一種視覺文本生成器(稱為SceneVTG),可以在實際環境中生成高質量的文本圖像。SceneVTG採用了雙階段範式,利用多模態大型語言模型跨多個尺度和層次推薦合理的文本區域和內容,這些被條件擴散模型用作生成文本圖像的條件。大量實驗表明,所提出的SceneVTG在保真度和合理性方面顯著優於傳統基於渲染的方法和最近的基於擴散的方法。此外,生成的圖像對涉及文本檢測和文本識別的任務提供了更優的實用性。代碼和數據集可在AdvancedLiterateMachinery上獲得。
English
Recently, with the rapid advancements of generative models, the field of
visual text generation has witnessed significant progress. However, it is still
challenging to render high-quality text images in real-world scenarios, as
three critical criteria should be satisfied: (1) Fidelity: the generated text
images should be photo-realistic and the contents are expected to be the same
as specified in the given conditions; (2) Reasonability: the regions and
contents of the generated text should cohere with the scene; (3) Utility: the
generated text images can facilitate related tasks (e.g., text detection and
recognition). Upon investigation, we find that existing methods, either
rendering-based or diffusion-based, can hardly meet all these aspects
simultaneously, limiting their application range. Therefore, we propose in this
paper a visual text generator (termed SceneVTG), which can produce high-quality
text images in the wild. Following a two-stage paradigm, SceneVTG leverages a
Multimodal Large Language Model to recommend reasonable text regions and
contents across multiple scales and levels, which are used by a conditional
diffusion model as conditions to generate text images. Extensive experiments
demonstrate that the proposed SceneVTG significantly outperforms traditional
rendering-based methods and recent diffusion-based methods in terms of fidelity
and reasonability. Besides, the generated images provide superior utility for
tasks involving text detection and text recognition. Code and datasets are
available at AdvancedLiterateMachinery.