MRT：用於大規模分層圖像生成與編輯的遮罩區域Transformer

摘要

分層影像生成與編輯是一項基礎能力，能實現生成視覺內容的逐層重複使用、編輯與組合，類似於自然語言中的單詞級編輯。儘管其重要性，這在規模上仍是未充分探索的領域。為填補此缺口，我們提出 MRT——一個具 200 億參數的遮罩區域擴散模型，專為多層透明影像生成與編輯設計，並以超過一千萬個涵蓋多種長寬比與文字提示的多語言設計樣本進行訓練。為充分發揮此規模，我們做出兩項關鍵技術貢獻。首先，我們將三項互補任務——文字轉圖層、影像轉圖層與圖層轉圖層——統整至共享的遮罩區域擴散框架中，透過選擇性標記遮罩實現靈活的逐層生成與編輯。其次，為實現溢出圖層生成，我們引入溢出感知畫布層，以處理邊界不一致問題並支援半透明背景合成，從而產生可擴展至可見畫布邊界之外的完整可編輯圖層。此外，我們應用擴散蒸餾技術，以最少品質降級達成八步即時多層生成。大量實驗證明，我們的框架在三項任務上均大幅超越現有最佳方法（包括多種商業系統），為多層透明影像生成樹立新標竿。值得注意的是，根據用戶研究結果，我們的模型在影像轉圖層品質上顯著優於同期 Qwen-Image-Layered 模型，同時在影像轉圖層推論中實現 10 至 100 倍的推論速度提升，並減少 50% 至 90% 的啟動 GPU 記憶體消耗。

English

Layered image generation and editing is a fundamental capability that enables layer-wise reuse, editing, and composition of generated visual content, analogous to word-level editing in natural language. Despite its importance, this remains an underexplored area at scale. To address this gap, we present MRT, a 20B-parameter masked region diffusion model tailored for multi-layer transparent image generation and editing, trained on over 10M multilingual design samples spanning diverse aspect ratios and textual prompts. To fully leverage this scale, we make two key technical contributions. First, we unify three complementary tasks including text-to-layers, image-to-layers, and layers-to-layers within a shared masked region diffusion framework, where selective token masking enables flexible layer-wise generation and editing. Second, to enable overflow layer generation, we introduce an overflow-aware canvas layer that handles boundary inconsistencies and supports semi-transparent background synthesis, enabling complete editable layers extending beyond visible canvas boundaries. Additionally, we apply diffusion distillation to achieve 8-step, real-time multi-layer generation with minimal quality degradation. Extensive experiments demonstrate that our framework substantially outperforms prior state-of-the-art approaches, including various commercial systems, across all three tasks, establishing a new benchmark for multi-layer transparent image generation. Notably, our model significantly outperforms the concurrent Qwen-Image-Layered model in image-to-layers quality according to user-study results, while achieving 10-100\times faster inference and reducing activation GPU memory consumption by 50-90\% during image-to-layer inference.