DreamLite: 이미지 생성 및 편집을 위한 경량 온디바이스 통합 모델

초록

디퓨전 모델은 텍스트-이미지(T2I) 생성과 텍스트 기반 이미지 편집 분야에서 모두 상당한 발전을 이루었습니다. 그러나 이러한 모델들은 일반적으로 수십억 개의 매개변수로 구성되어 높은 지연 시간과 배포상의 어려움을 초래합니다. 온디바이스 디퓨전 모델은 효율성을 개선하지만, 대부분 T2I 생성에 집중하여 이미지 편집 기능을 지원하지 않습니다. 본 논문에서는 단일 네트워크 내에서 T2I 생성과 텍스트 기반 이미지 편집을 모두 지원하는 경량 통합 온디바이스 디퓨전 모델(0.39B)인 DreamLite를 제안합니다. DreamLite는 경량화된 모바일 U-Net 백본을 기반으로 하며, 잠재 공간에서의 인-컨텍스트 공간 결합을 통해 조건 설정을 통합합니다. 이 모델은 이미지를 수평으로 연결하여 입력으로 사용하며, 생성 작업에는 (대상 | 공백) 구성을, 편집 작업에는 (대상 | 원본) 구성을 적용합니다. 이 경량 모델의 학습 안정화를 위해 T2I, 편집, 통합 작업을 순차적으로 대상으로 하는 작업 점진적 통합 사전 학습 전략을 도입했습니다. 고품질 SFT와 강화 학습 이후 DreamLite는 이미지 생성에서 GenEval(0.72), 이미지 편집에서 ImgEdit(4.11) 점수를 달성하여 기존 온디바이스 모델들을 능가하고 여러 서버 측 모델과도 경쟁력을 유지했습니다. 스텝 디스틸레이션을 적용하여 노이즈 제거 처리 단계를 4단계로 축소함으로써, DreamLite가 샤오미 14 스마트폰에서 1024x1024 이미지를 1초 미만으로 생성하거나 편집할 수 있도록 했습니다. 우리가 아는 한, DreamLite는 이미지 생성과 이미지 편집을 모두 지원하는 최초의 통합 온디바이스 디퓨전 모델입니다.

English

Diffusion models have made significant progress in both text-to-image (T2I) generation and text-guided image editing. However, these models are typically built with billions of parameters, leading to high latency and increased deployment challenges. While on-device diffusion models improve efficiency, they largely focus on T2I generation and lack support for image editing. In this paper, we propose DreamLite, a compact unified on-device diffusion model (0.39B) that supports both T2I generation and text-guided image editing within a single network. DreamLite is built on a pruned mobile U-Net backbone and unifies conditioning through in-context spatial concatenation in the latent space. It concatenates images horizontally as input, using a (target | blank) configuration for generation tasks and (target | source) for editing tasks. To stabilize the training of this compact model, we introduce a task-progressive joint pretraining strategy that sequentially targets T2I, editing, and joint tasks. After high-quality SFT and reinforcement learning, DreamLite achieves GenEval (0.72) for image generation and ImgEdit (4.11) for image editing, outperforming existing on-device models and remaining competitive with several server-side models. By employing step distillation, we further reduce denoising processing to just 4 steps, enabling our DreamLite could generate or edit a 1024 x 1024 image in less than 1s on a Xiaomi 14 smartphone. To the best of our knowledge, DreamLite is the first unified on-device diffusion model that supports both image generation and image editing.

DreamLite: 이미지 생성 및 편집을 위한 경량 온디바이스 통합 모델

DreamLite: A Lightweight On-Device Unified Model for Image Generation and Editing

초록

Support