DreamLite:面向圖像生成與編輯的輕量化端側統一模型
DreamLite: A Lightweight On-Device Unified Model for Image Generation and Editing
March 30, 2026
作者: Kailai Feng, Yuxiang Wei, Bo Chen, Yang Pan, Hu Ye, Songwei Liu, Chenqian Yan, Yuan Gao
cs.AI
摘要
扩散模型在文本到图像生成及文本引导图像编辑领域均已取得显著进展。然而,这些模型通常包含数十亿参数,导致高延迟和部署挑战。虽然设备端扩散模型提升了效率,但大多聚焦于文本到图像生成,缺乏对图像编辑的支持。本文提出DreamLite——一个参数量为0.39B的紧凑型统一设备端扩散模型,在单一网络中同时支持文本到图像生成与文本引导图像编辑。该模型基于剪枝的移动端U-Net架构,通过潜在空间中的上下文级联实现统一条件控制:生成任务采用(目标|空白)图像水平拼接输入,编辑任务采用(目标|源图)拼接方式。为稳定训练紧凑模型,我们提出任务渐进式联合预训练策略,依次针对文本到图像生成、图像编辑及联合任务进行训练。经过高质量指令微调与强化学习后,DreamLite在图像生成任务上获得GenEval(0.72分),图像编辑任务取得ImgEdit(4.11分),性能超越现有设备端模型,并与多个服务器模型保持竞争力。通过采用步数蒸馏技术,我们将去噪过程缩减至仅需4步,使DreamLite可在小米14手机上于1秒内完成1024×1024图像的生成或编辑。据我们所知,这是首个支持图像生成与编辑功能的统一设备端扩散模型。
English
Diffusion models have made significant progress in both text-to-image (T2I) generation and text-guided image editing. However, these models are typically built with billions of parameters, leading to high latency and increased deployment challenges. While on-device diffusion models improve efficiency, they largely focus on T2I generation and lack support for image editing. In this paper, we propose DreamLite, a compact unified on-device diffusion model (0.39B) that supports both T2I generation and text-guided image editing within a single network. DreamLite is built on a pruned mobile U-Net backbone and unifies conditioning through in-context spatial concatenation in the latent space. It concatenates images horizontally as input, using a (target | blank) configuration for generation tasks and (target | source) for editing tasks. To stabilize the training of this compact model, we introduce a task-progressive joint pretraining strategy that sequentially targets T2I, editing, and joint tasks. After high-quality SFT and reinforcement learning, DreamLite achieves GenEval (0.72) for image generation and ImgEdit (4.11) for image editing, outperforming existing on-device models and remaining competitive with several server-side models. By employing step distillation, we further reduce denoising processing to just 4 steps, enabling our DreamLite could generate or edit a 1024 x 1024 image in less than 1s on a Xiaomi 14 smartphone. To the best of our knowledge, DreamLite is the first unified on-device diffusion model that supports both image generation and image editing.