PosterLLaVa: LLMを活用した統一的なマルチモーダルレイアウト生成器の構築

要旨

レイアウト生成は、自動化されたグラフィックデザインを実現するための基盤であり、多様なマルチモーダルなデザイン要素の位置とサイズを視覚的に美しく、かつ制約に従って配置することを要求します。従来のアプローチは、大規模なアプリケーションに対して非効率的であるか、または多様なデザイン要件に対応する柔軟性に欠けていました。本研究では、多様なデザインタスクに対応するために、マルチモーダル大規模言語モデル（MLLM）を活用した自動化されたグラフィックレイアウト生成の統一フレームワークを提案します。対照的に、我々のデータ駆動型手法は、構造化されたテキスト（JSON形式）と視覚的指示チューニングを使用して、特定の視覚的およびテキスト的制約（ユーザー定義の自然言語仕様を含む）の下でレイアウトを生成します。我々は広範な実験を行い、公開されているマルチモーダルレイアウト生成ベンチマークで最先端（SOTA）の性能を達成し、本手法の有効性を実証しました。さらに、既存のデータセットが現実世界のグラフィックデザインの複雑さを捉える上での限界を認識し、より挑戦的なタスク（ユーザー制約付き生成と複雑なポスター）に対応するための2つの新しいデータセットを提案し、現実世界の設定における我々のモデルの有用性をさらに検証しました。このアプローチは、その優れたアクセシビリティと適応性により、大規模なグラフィックデザインタスクをさらに自動化します。コードとデータセットはhttps://github.com/posterllava/PosterLLaVAで公開されます。

English

Layout generation is the keystone in achieving automated graphic design, requiring arranging the position and size of various multi-modal design elements in a visually pleasing and constraint-following manner. Previous approaches are either inefficient for large-scale applications or lack flexibility for varying design requirements. Our research introduces a unified framework for automated graphic layout generation, leveraging the multi-modal large language model (MLLM) to accommodate diverse design tasks. In contrast, our data-driven method employs structured text (JSON format) and visual instruction tuning to generate layouts under specific visual and textual constraints, including user-defined natural language specifications. We conducted extensive experiments and achieved state-of-the-art (SOTA) performance on public multi-modal layout generation benchmarks, demonstrating the effectiveness of our method. Moreover, recognizing existing datasets' limitations in capturing the complexity of real-world graphic designs, we propose two new datasets for much more challenging tasks (user-constrained generation and complicated poster), further validating our model's utility in real-life settings. Marking by its superior accessibility and adaptability, this approach further automates large-scale graphic design tasks. The code and datasets will be publicly available on https://github.com/posterllava/PosterLLaVA.

PosterLLaVa: LLMを活用した統一的なマルチモーダルレイアウト生成器の構築

PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM

要旨

Support