LaDe: 統合型多層グラフィックメディア生成・分解

要旨

メディアデザインレイヤー生成は、自然言語プロンプトのみを用いて、ポスター、フライヤー、ロゴなどの完全に編集可能なレイヤー構造を持つデザインドキュメントの作成を可能にする。既存手法では、出力を固定数のレイヤーに制限するか、各レイヤーが空間的に連続した領域のみを含むことを要求するため、デザインの複雑さに応じてレイヤー数が線形的に増加するという課題があった。本論文では、意味的に有意義なレイヤーを柔軟な数で生成する潜在拡散フレームワーク、LaDe（Layered Media Design）を提案する。LaDeは3つのコンポーネントを組み合わせている：短いユーザー意図を構造化されたレイヤーごとの記述に変換し生成を誘導するLLMベースのプロンプト拡張器、4D RoPE位置符号化機構を備えた潜在拡散トランスフォーマーによりメディアデザイン全体とその構成RGBAレイヤーを共同生成する機構、完全なアルファチャンネルサポートで各レイヤーをデコードするRGBA VAEである。訓練中にレイヤーサンプルを条件付けることで、本統一フレームワークは、テキストから画像への生成、テキストからレイヤー構造を持つメディアデザインへの生成、メディアデザインの分解という3つのタスクをサポートする。Crelloテストセットを用いたテキストからレイヤーへの生成および画像からレイヤーへの生成タスクにおいて、LaDeをQwen-Image-Layeredと比較した。LaDeは、2つのVLM-as-a-judge評価器（GPT-4o miniおよびQwen3-VL）による検証により、テキストとレイヤーの整合性を改善することで、テキストからレイヤーへの生成タスクにおいてQwen-Image-Layeredを上回った。

English

Media design layer generation enables the creation of fully editable, layered design documents such as posters, flyers, and logos using only natural language prompts. Existing methods either restrict outputs to a fixed number of layers or require each layer to contain only spatially continuous regions, causing the layer count to scale linearly with design complexity. We propose LaDe (Layered Media Design), a latent diffusion framework that generates a flexible number of semantically meaningful layers. LaDe combines three components: an LLM-based prompt expander that transforms a short user intent into structured per-layer descriptions that guide the generation, a Latent Diffusion Transformer with a 4D RoPE positional encoding mechanism that jointly generates the full media design and its constituent RGBA layers, and an RGBA VAE that decodes each layer with full alpha-channel support. By conditioning on layer samples during training, our unified framework supports three tasks: text-to-image generation, text-to-layers media design generation, and media design decomposition. We compare LaDe to Qwen-Image-Layered on text-to-layers and image-to-layers tasks on the Crello test set. LaDe outperforms Qwen-Image-Layered in text-to-layers generation by improving text-to-layer alignment, as validated by two VLM-as-a-judge evaluators (GPT-4o mini and Qwen3-VL).

LaDe: 統合型多層グラフィックメディア生成・分解

LaDe: Unified Multi-Layered Graphic Media Generation and Decomposition

要旨

Support