MRT: 대규모 계층 이미지 생성 및 편집을 위한 마스크 영역 트랜스포머

초록

계층적 이미지 생성 및 편집은 생성된 시각적 콘텐츠를 레이어 단위로 재사용, 편집 및 구성할 수 있는 기본 기능으로, 자연어에서 단어 수준 편집과 유사합니다. 이러한 중요성에도 불구, 대규모에서 이 영역은 아직 충분히 탐구되지 않았습니다. 이러한 격차를 해소하기 위해, 우리는 MRT를 제안합니다. MRT는 다양한 종횡비와 텍스트 프롬프트를 포괄하는 1,000만 개 이상의 다국어 디자인 샘플로 학습된, 200억 개의 매개변수를 가진 다중 레이어 투명 이미지 생성 및 편집에 특화된 마스크 영역 확산 모델입니다. 이러한 규모를 최대한 활용하기 위해 두 가지 핵심 기술적 기여를 합니다. 첫째, 텍스트-레이어, 이미지-레이어, 레이어-레이어라는 세 가지 상호보완적인 작업을 공유 마스크 영역 확산 프레임워크 내에서 통합하여, 선택적 토큰 마스킹을 통해 유연한 레이어 단위 생성 및 편집을 가능하게 합니다. 둘째, 오버플로우 레이어 생성을 가능하게 하기 위해 오버플로우 인식 캔버스 레이어를 도입하여 경계 불일치를 처리하고 반투명 배경 합성을 지원함으로써, 보이는 캔버스 경계를 넘어서는 완전히 편집 가능한 레이어를 생성합니다. 또한 확산 증류를 적용하여 최소한의 품질 저하로 8단계 실시간 다중 레이어 생성을 달성합니다. 광범위한 실험을 통해 우리의 프레임워크가 다양한 상용 시스템을 포함한 이전 최첨단 접근법을 세 가지 작업 모두에서 크게 능가하며, 다중 레이어 투명 이미지 생성의 새로운 기준을 수립함을 보여줍니다. 특히, 우리 모델은 사용자 연구 결과에 따라 이미지-레이어 품질에서 동시대의 Qwen-Image-Layered 모델을 크게 능가할 뿐만 아니라, 이미지-레이어 추론 시 10~100배 빠른 추론 속도와 50~90%의 활성 GPU 메모리 소비 감소를 달성합니다.

English

Layered image generation and editing is a fundamental capability that enables layer-wise reuse, editing, and composition of generated visual content, analogous to word-level editing in natural language. Despite its importance, this remains an underexplored area at scale. To address this gap, we present MRT, a 20B-parameter masked region diffusion model tailored for multi-layer transparent image generation and editing, trained on over 10M multilingual design samples spanning diverse aspect ratios and textual prompts. To fully leverage this scale, we make two key technical contributions. First, we unify three complementary tasks including text-to-layers, image-to-layers, and layers-to-layers within a shared masked region diffusion framework, where selective token masking enables flexible layer-wise generation and editing. Second, to enable overflow layer generation, we introduce an overflow-aware canvas layer that handles boundary inconsistencies and supports semi-transparent background synthesis, enabling complete editable layers extending beyond visible canvas boundaries. Additionally, we apply diffusion distillation to achieve 8-step, real-time multi-layer generation with minimal quality degradation. Extensive experiments demonstrate that our framework substantially outperforms prior state-of-the-art approaches, including various commercial systems, across all three tasks, establishing a new benchmark for multi-layer transparent image generation. Notably, our model significantly outperforms the concurrent Qwen-Image-Layered model in image-to-layers quality according to user-study results, while achieving 10-100\times faster inference and reducing activation GPU memory consumption by 50-90\% during image-to-layer inference.