LaDe：一体化多层图形媒体生成与解构系统

摘要

媒体设计图层生成技术实现了仅通过自然语言提示即可创建完全可编辑的分层设计文档（如海报、传单和标识）。现有方法要么将输出限制在固定图层数量，要么要求每个图层仅包含空间连续区域，导致图层数量随设计复杂度线性增长。我们提出LaDe（分层媒体设计）——一种能够生成灵活数量语义化图层的潜在扩散框架。该框架融合三大组件：基于大语言模型的提示扩展器，将简短用户意图转化为结构化分层描述以指导生成；采用4D RoPE位置编码机制的潜在扩散变换器，联合生成完整媒体设计及其RGBA组成图层；支持完整Alpha通道的RGBA变分自编码器，用于解码各图层。通过在训练中引入图层样本条件化，我们的统一框架支持三大任务：文本到图像生成、文本到图层媒体设计生成以及媒体设计解构。在Crello测试集上，我们针对文本到图层和图像到图层任务将LaDe与Qwen-Image-Layered进行对比。经两种VLM评估器（GPT-4o mini和Qwen3-VL）验证，LaDe通过提升文本与图层对齐度，在文本到图层生成任务中表现优于Qwen-Image-Layered。

English

Media design layer generation enables the creation of fully editable, layered design documents such as posters, flyers, and logos using only natural language prompts. Existing methods either restrict outputs to a fixed number of layers or require each layer to contain only spatially continuous regions, causing the layer count to scale linearly with design complexity. We propose LaDe (Layered Media Design), a latent diffusion framework that generates a flexible number of semantically meaningful layers. LaDe combines three components: an LLM-based prompt expander that transforms a short user intent into structured per-layer descriptions that guide the generation, a Latent Diffusion Transformer with a 4D RoPE positional encoding mechanism that jointly generates the full media design and its constituent RGBA layers, and an RGBA VAE that decodes each layer with full alpha-channel support. By conditioning on layer samples during training, our unified framework supports three tasks: text-to-image generation, text-to-layers media design generation, and media design decomposition. We compare LaDe to Qwen-Image-Layered on text-to-layers and image-to-layers tasks on the Crello test set. LaDe outperforms Qwen-Image-Layered in text-to-layers generation by improving text-to-layer alignment, as validated by two VLM-as-a-judge evaluators (GPT-4o mini and Qwen3-VL).