BizGen：推动信息图表生成中的文章级视觉文本渲染技术

摘要

近期，诸如Flux和Ideogram 2.0等顶尖的文本到图像生成模型在句子级别的视觉文本渲染方面取得了显著进展。本文聚焦于更具挑战性的文章级视觉文本渲染场景，并致力于解决一项新颖任务：基于用户提供的文章级描述性提示和超密集布局，生成高质量的商业内容，包括信息图表和幻灯片。这一任务面临两大根本性挑战：显著延长的上下文长度以及高质量商业内容数据的稀缺性。与以往多数研究局限于有限子区域和句子级提示不同，确保在商业内容中精确遵循包含数十甚至上百个子区域的超密集布局，其难度远胜以往。我们做出了两项关键技术贡献：（一）构建了可扩展的高质量商业内容数据集，即Infographics-650K，通过实施分层检索增强的信息图表生成方案，配备了超密集布局和提示；（二）提出了一种布局引导的交叉注意力机制，该机制根据超密集布局将数十个区域级提示注入一组裁剪后的区域潜在空间，并在推理过程中利用布局条件CFG灵活优化每个子区域。我们展示了系统相较于Flux和SD3等先前SOTA系统在BizEval提示集上的优异表现。此外，我们进行了详尽的消融实验，以验证各组成部分的有效性。我们期望所构建的Infographics-650K和BizEval能够激励更广泛的社区推动商业内容生成领域的进步。

English

Recently, state-of-the-art text-to-image generation models, such as Flux and Ideogram 2.0, have made significant progress in sentence-level visual text rendering. In this paper, we focus on the more challenging scenarios of article-level visual text rendering and address a novel task of generating high-quality business content, including infographics and slides, based on user provided article-level descriptive prompts and ultra-dense layouts. The fundamental challenges are twofold: significantly longer context lengths and the scarcity of high-quality business content data. In contrast to most previous works that focus on a limited number of sub-regions and sentence-level prompts, ensuring precise adherence to ultra-dense layouts with tens or even hundreds of sub-regions in business content is far more challenging. We make two key technical contributions: (i) the construction of scalable, high-quality business content dataset, i.e., Infographics-650K, equipped with ultra-dense layouts and prompts by implementing a layer-wise retrieval-augmented infographic generation scheme; and (ii) a layout-guided cross attention scheme, which injects tens of region-wise prompts into a set of cropped region latent space according to the ultra-dense layouts, and refine each sub-regions flexibly during inference using a layout conditional CFG. We demonstrate the strong results of our system compared to previous SOTA systems such as Flux and SD3 on our BizEval prompt set. Additionally, we conduct thorough ablation experiments to verify the effectiveness of each component. We hope our constructed Infographics-650K and BizEval can encourage the broader community to advance the progress of business content generation.

BizGen：推动信息图表生成中的文章级视觉文本渲染技术

BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation

摘要

Support