DreamLite：画像生成と編集のための軽量オンデバイス統合モデル

要旨

拡散モデルは、テキストから画像への生成（T2I）およびテキスト誘導型画像編集の両方において著しい進歩を遂げている。しかし、これらのモデルは通常数十億のパラメータで構築されており、高いレイテンシと導入上の課題をもたらす。オンデバイス拡散モデルは効率性を改善するが、その多くはT2I生成に焦点を当てており、画像編集のサポートが欠如している。本論文では、単一ネットワーク内でT2I生成とテキスト誘導型画像編集の両方をサポートするコンパクトな統合オンデバイス拡散モデル（0.39B）であるDreamLiteを提案する。DreamLiteは剪定されたモバイルU-Netバックボーン上に構築され、潜在空間におけるインコンテキスト空間連結による条件付けを統合する。画像を水平方向に連結して入力とし、生成タスクには（ターゲット｜空白）、編集タスクには（ターゲット｜ソース）の構成を採用する。このコンパクトモデルの学習を安定化させるため、T2I、編集、統合タスクを段階的に対象とするタスク漸進的共同事前学習戦略を導入する。高品質なSFTと強化学習後、DreamLiteは画像生成でGenEval（0.72）、画像編集でImgEdit（4.11）を達成し、既存のオンデバイスモデルを凌駕し、いくつかのサーバーサイドモデルにも遜色ない性能を示す。ステップ蒸留を採用することでノイズ除去処理を4ステップにまで削減し、Xiaomi 14スマートフォン上で1024x1024画像の生成または編集を1秒未満で可能にした。我々の知る限り、DreamLiteは画像生成と画像編集の両方をサポートする初の統合オンデバイス拡散モデルである。

English

Diffusion models have made significant progress in both text-to-image (T2I) generation and text-guided image editing. However, these models are typically built with billions of parameters, leading to high latency and increased deployment challenges. While on-device diffusion models improve efficiency, they largely focus on T2I generation and lack support for image editing. In this paper, we propose DreamLite, a compact unified on-device diffusion model (0.39B) that supports both T2I generation and text-guided image editing within a single network. DreamLite is built on a pruned mobile U-Net backbone and unifies conditioning through in-context spatial concatenation in the latent space. It concatenates images horizontally as input, using a (target | blank) configuration for generation tasks and (target | source) for editing tasks. To stabilize the training of this compact model, we introduce a task-progressive joint pretraining strategy that sequentially targets T2I, editing, and joint tasks. After high-quality SFT and reinforcement learning, DreamLite achieves GenEval (0.72) for image generation and ImgEdit (4.11) for image editing, outperforming existing on-device models and remaining competitive with several server-side models. By employing step distillation, we further reduce denoising processing to just 4 steps, enabling our DreamLite could generate or edit a 1024 x 1024 image in less than 1s on a Xiaomi 14 smartphone. To the best of our knowledge, DreamLite is the first unified on-device diffusion model that supports both image generation and image editing.

DreamLite：画像生成と編集のための軽量オンデバイス統合モデル

DreamLite: A Lightweight On-Device Unified Model for Image Generation and Editing

要旨

Support