BizGen：インフォグラフィック生成のための記事レベルの視覚的テキストレンダリングの進展

要旨

近年、FluxやIdeogram 2.0などの最先端のテキストから画像を生成するモデルは、文レベルの視覚的テキストレンダリングにおいて大きな進歩を遂げています。本論文では、より挑戦的な記事レベルの視覚的テキストレンダリングのシナリオに焦点を当て、ユーザーが提供する記事レベルの記述的プロンプトと超高密度レイアウトに基づいて、インフォグラフィックやスライドを含む高品質なビジネスコンテンツを生成する新たなタスクに取り組みます。根本的な課題は二つあります：大幅に長いコンテキスト長と、高品質なビジネスコンテンツデータの不足です。これまでの研究の多くが限られた数のサブ領域と文レベルのプロンプトに焦点を当ててきたのに対し、ビジネスコンテンツにおいて数十または数百のサブ領域を持つ超高密度レイアウトに正確に従うことははるかに困難です。私たちは二つの重要な技術的貢献をします：（i）レイヤーごとの検索拡張型インフォグラフィック生成スキームを実装することで、超高密度レイアウトとプロンプトを備えたスケーラブルで高品質なビジネスコンテンツデータセット、すなわちInfographics-650Kを構築すること；（ii）レイアウト誘導型クロスアテンションスキームを提案し、超高密度レイアウトに従って数十の領域ごとのプロンプトを切り取られた領域の潜在空間に注入し、推論中にレイアウト条件付きCFGを使用して各サブ領域を柔軟に精緻化することです。私たちのシステムは、FluxやSD3などの以前のSOTAシステムと比較して、BizEvalプロンプトセットにおいて強力な結果を示します。さらに、各コンポーネントの有効性を検証するために徹底的なアブレーション実験を実施します。私たちが構築したInfographics-650KとBizEvalが、広範なコミュニティがビジネスコンテンツ生成の進展を促進することを願っています。

English

Recently, state-of-the-art text-to-image generation models, such as Flux and Ideogram 2.0, have made significant progress in sentence-level visual text rendering. In this paper, we focus on the more challenging scenarios of article-level visual text rendering and address a novel task of generating high-quality business content, including infographics and slides, based on user provided article-level descriptive prompts and ultra-dense layouts. The fundamental challenges are twofold: significantly longer context lengths and the scarcity of high-quality business content data. In contrast to most previous works that focus on a limited number of sub-regions and sentence-level prompts, ensuring precise adherence to ultra-dense layouts with tens or even hundreds of sub-regions in business content is far more challenging. We make two key technical contributions: (i) the construction of scalable, high-quality business content dataset, i.e., Infographics-650K, equipped with ultra-dense layouts and prompts by implementing a layer-wise retrieval-augmented infographic generation scheme; and (ii) a layout-guided cross attention scheme, which injects tens of region-wise prompts into a set of cropped region latent space according to the ultra-dense layouts, and refine each sub-regions flexibly during inference using a layout conditional CFG. We demonstrate the strong results of our system compared to previous SOTA systems such as Flux and SD3 on our BizEval prompt set. Additionally, we conduct thorough ablation experiments to verify the effectiveness of each component. We hope our constructed Infographics-650K and BizEval can encourage the broader community to advance the progress of business content generation.

BizGen：インフォグラフィック生成のための記事レベルの視覚的テキストレンダリングの進展

BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation

要旨

Support