MRT: 大規模レイヤー画像生成と編集のためのマスク領域トランスフォーマー

要旨

階層化画像生成と編集は、生成された視覚コンテンツのレイヤー単位での再利用、編集、合成を可能にする基本的な能力であり、自然言語における単語レベルの編集に類似する。その重要性にもかかわらず、この分野は大規模な研究が十分に行われていない。この課題に対処するため、我々は200億パラメータのマスク領域拡散モデルMRTを提案する。本モデルは多層透明画像の生成と編集に特化しており、多様なアスペクト比とテキストプロンプトをカバーする1000万以上の多言語デザインサンプルで学習されている。このスケールを最大限に活用するため、我々は2つの主要な技術的貢献を行う。第一に、テキストからレイヤー、画像からレイヤー、レイヤーからレイヤーという3つの相補的なタスクを、共有のマスク領域拡散フレームワーク内で統合する。ここでは選択的トークンマスキングにより柔軟なレイヤー単位の生成と編集を可能にする。第二に、オーバーフローレイヤー生成を実現するため、境界の不整合を処理し半透明な背景合成をサポートするオーバーフロー対応キャンバスレイヤーを導入し、可視キャンバス境界を超えた完全な編集可能レイヤーを可能にする。さらに、拡散蒸留を適用することで、品質劣化を最小限に抑えながら8ステップのリアルタイム多層生成を達成する。大規模な実験により、我々のフレームワークは、3つのタスクすべてにおいて、様々な商用システムを含む従来の最先端手法を大幅に上回り、多層透明画像生成の新たなベンチマークを確立する。特に、ユーザー調査によれば、本モデルは画像からレイヤーへの品質において、同時期のQwen-Image-Layeredモデルを有意に上回り、画像からレイヤーへの推論において10～100倍の高速化と50～90%のアクティブGPUメモリ消費削減を達成する。

English

Layered image generation and editing is a fundamental capability that enables layer-wise reuse, editing, and composition of generated visual content, analogous to word-level editing in natural language. Despite its importance, this remains an underexplored area at scale. To address this gap, we present MRT, a 20B-parameter masked region diffusion model tailored for multi-layer transparent image generation and editing, trained on over 10M multilingual design samples spanning diverse aspect ratios and textual prompts. To fully leverage this scale, we make two key technical contributions. First, we unify three complementary tasks including text-to-layers, image-to-layers, and layers-to-layers within a shared masked region diffusion framework, where selective token masking enables flexible layer-wise generation and editing. Second, to enable overflow layer generation, we introduce an overflow-aware canvas layer that handles boundary inconsistencies and supports semi-transparent background synthesis, enabling complete editable layers extending beyond visible canvas boundaries. Additionally, we apply diffusion distillation to achieve 8-step, real-time multi-layer generation with minimal quality degradation. Extensive experiments demonstrate that our framework substantially outperforms prior state-of-the-art approaches, including various commercial systems, across all three tasks, establishing a new benchmark for multi-layer transparent image generation. Notably, our model significantly outperforms the concurrent Qwen-Image-Layered model in image-to-layers quality according to user-study results, while achieving 10-100\times faster inference and reducing activation GPU memory consumption by 50-90\% during image-to-layer inference.