LaDe：统一多层图形媒体的生成与分解

摘要

媒體設計圖層生成技術能夠僅通過自然語言指令即可創建完全可編輯的分層設計文檔（如海報、傳單和標誌）。現有方法要麼將輸出限制在固定數量的圖層，要麼要求每個圖層僅包含空間連續區域，導致圖層數量隨設計複雜度線性增長。我們提出LaDe（分層媒體設計）——一種潛在擴散框架，可生成靈活數量的語義化圖層。該框架融合三大組件：基於大語言模型的提示擴展器，將簡短用戶意圖轉化為指導生成的結構化分層描述；採用四維旋轉位置編碼機制的潛散變壓器，聯合生成完整媒體設計及其RGBA構成圖層；具備全阿爾法通道支持的RGBA變分自編碼器，可解碼各圖層。通過在訓練中引入圖層樣本條件化，我們的統一框架支持三項任務：文本到圖像生成、文本到分層媒體設計生成以及媒體設計解構。在Crello測試集上，我們將LaDe與Qwen-Image-Layered在文本到圖層和圖像到圖層任務上進行對比。經兩個視覺語言模型評測器（GPT-4o mini與Qwen3-VL）驗證，LaDe在文本到圖層生成任務中憑藉更優的文本-圖層對齊度超越Qwen-Image-Layered。

English

Media design layer generation enables the creation of fully editable, layered design documents such as posters, flyers, and logos using only natural language prompts. Existing methods either restrict outputs to a fixed number of layers or require each layer to contain only spatially continuous regions, causing the layer count to scale linearly with design complexity. We propose LaDe (Layered Media Design), a latent diffusion framework that generates a flexible number of semantically meaningful layers. LaDe combines three components: an LLM-based prompt expander that transforms a short user intent into structured per-layer descriptions that guide the generation, a Latent Diffusion Transformer with a 4D RoPE positional encoding mechanism that jointly generates the full media design and its constituent RGBA layers, and an RGBA VAE that decodes each layer with full alpha-channel support. By conditioning on layer samples during training, our unified framework supports three tasks: text-to-image generation, text-to-layers media design generation, and media design decomposition. We compare LaDe to Qwen-Image-Layered on text-to-layers and image-to-layers tasks on the Crello test set. LaDe outperforms Qwen-Image-Layered in text-to-layers generation by improving text-to-layer alignment, as validated by two VLM-as-a-judge evaluators (GPT-4o mini and Qwen3-VL).

LaDe：统一多层图形媒体的生成与分解

LaDe: Unified Multi-Layered Graphic Media Generation and Decomposition

摘要

Support