BizGen:推動資訊圖表生成中的文章級視覺文本渲染技術
BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation
March 26, 2025
作者: Yuyang Peng, Shishi Xiao, Keming Wu, Qisheng Liao, Bohan Chen, Kevin Lin, Danqing Huang, Ji Li, Yuhui Yuan
cs.AI
摘要
近期,諸如Flux和Ideogram 2.0等尖端文本到圖像生成模型在句子層面的視覺文本渲染上取得了顯著進展。本文聚焦於更具挑戰性的文章層面視覺文本渲染場景,並探討了一項新穎任務:基於用戶提供的文章層面描述性提示和超密集佈局,生成高質量的商業內容,包括信息圖表和幻燈片。這一任務面臨的根本挑戰有兩方面:顯著增長的上下文長度以及高質量商業內容數據的稀缺性。
與以往大多數研究僅關注有限數量的子區域和句子層面提示不同,確保在商業內容中精確遵循包含數十甚至數百個子區域的超密集佈局,其難度要大得多。我們做出了兩項關鍵技術貢獻:(i) 構建了可擴展的高質量商業內容數據集,即Infographics-650K,通過實施分層檢索增強的信息圖生成方案,配備了超密集佈局和提示;(ii) 一種佈局引導的交叉注意力機制,該機制根據超密集佈局將數十個區域性提示注入到一組裁剪區域的潛在空間中,並在推理過程中利用佈局條件CFG靈活地細化每個子區域。
我們展示了與Flux和SD3等先前SOTA系統相比,在BizEval提示集上我們系統的強勁表現。此外,我們進行了全面的消融實驗,以驗證每個組件的有效性。我們希望構建的Infographics-650K和BizEval能夠激勵更廣泛的社區推動商業內容生成的進步。
English
Recently, state-of-the-art text-to-image generation models, such as Flux and
Ideogram 2.0, have made significant progress in sentence-level visual text
rendering. In this paper, we focus on the more challenging scenarios of
article-level visual text rendering and address a novel task of generating
high-quality business content, including infographics and slides, based on user
provided article-level descriptive prompts and ultra-dense layouts. The
fundamental challenges are twofold: significantly longer context lengths and
the scarcity of high-quality business content data.
In contrast to most previous works that focus on a limited number of
sub-regions and sentence-level prompts, ensuring precise adherence to
ultra-dense layouts with tens or even hundreds of sub-regions in business
content is far more challenging. We make two key technical contributions: (i)
the construction of scalable, high-quality business content dataset, i.e.,
Infographics-650K, equipped with ultra-dense layouts and prompts by
implementing a layer-wise retrieval-augmented infographic generation scheme;
and (ii) a layout-guided cross attention scheme, which injects tens of
region-wise prompts into a set of cropped region latent space according to the
ultra-dense layouts, and refine each sub-regions flexibly during inference
using a layout conditional CFG.
We demonstrate the strong results of our system compared to previous SOTA
systems such as Flux and SD3 on our BizEval prompt set. Additionally, we
conduct thorough ablation experiments to verify the effectiveness of each
component. We hope our constructed Infographics-650K and BizEval can encourage
the broader community to advance the progress of business content generation.Summary
AI-Generated Summary