MRT：面向大规模分层图像生成与编辑的掩码区域Transformer

摘要

分层图像生成与编辑是一项基础能力，能够实现生成视觉内容的逐层复用、编辑与组合，类似于自然语言中的单词级编辑。尽管其重要性显著，但在大规模场景下该领域仍处于探索不足的状态。为填补这一空白，我们提出MRT——一个200亿参数的掩码区域扩散模型，专为多层透明图像生成与编辑设计，基于超过1000万个涵盖多种宽高比和文本提示的多语言设计样本进行训练。为充分利用这一规模优势，我们做出两项关键技术贡献。首先，我们将文本到层、图像到层、层到层三类互补任务统一到一个共享的掩码区域扩散框架中，通过选择性标记掩码实现灵活的逐层生成与编辑。其次，为实现越界层生成，我们引入了一种溢出感知画布层，可处理边界不一致性并支持半透明背景合成，从而生成超出可见画布边界的完整可编辑层。此外，我们应用扩散蒸馏技术实现8步实时多层生成，且质量损失极小。大量实验表明，我们的框架在所有三项任务上均显著优于先前最先进的方法（包括多种商业系统），为多层透明图像生成建立了新基准。值得注意的是，根据用户研究结果，我们的模型在图像到层的质量上显著优于同期Qwen-Image-Layered模型，同时在图像到层推理中实现10-100倍的推理加速，并将激活显存消耗降低50-90%。

English

Layered image generation and editing is a fundamental capability that enables layer-wise reuse, editing, and composition of generated visual content, analogous to word-level editing in natural language. Despite its importance, this remains an underexplored area at scale. To address this gap, we present MRT, a 20B-parameter masked region diffusion model tailored for multi-layer transparent image generation and editing, trained on over 10M multilingual design samples spanning diverse aspect ratios and textual prompts. To fully leverage this scale, we make two key technical contributions. First, we unify three complementary tasks including text-to-layers, image-to-layers, and layers-to-layers within a shared masked region diffusion framework, where selective token masking enables flexible layer-wise generation and editing. Second, to enable overflow layer generation, we introduce an overflow-aware canvas layer that handles boundary inconsistencies and supports semi-transparent background synthesis, enabling complete editable layers extending beyond visible canvas boundaries. Additionally, we apply diffusion distillation to achieve 8-step, real-time multi-layer generation with minimal quality degradation. Extensive experiments demonstrate that our framework substantially outperforms prior state-of-the-art approaches, including various commercial systems, across all three tasks, establishing a new benchmark for multi-layer transparent image generation. Notably, our model significantly outperforms the concurrent Qwen-Image-Layered model in image-to-layers quality according to user-study results, while achieving 10-100\times faster inference and reducing activation GPU memory consumption by 50-90\% during image-to-layer inference.