ChatPaper.aiChatPaper

DeepGen 1.0:面向图像生成与编辑进阶的轻量化统一多模态模型

DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

February 12, 2026
作者: Dianyi Wang, Ruihang Li, Feng Han, Chaofan Ma, Wei Song, Siyuan Wang, Yibin Wang, Yi Xin, Hongjian Liu, Zhixiong Zhang, Shengyuan Ding, Tianhang Wang, Zhenglin Cheng, Tao Lin, Cheng Jin, Kaicheng Yu, Jingjing Chen, Wenjie Wang, Zhongyu Wei, Jiaqi Wang
cs.AI

摘要

当前用于图像生成与编辑的统一多模态模型通常依赖海量参数规模(如超过100亿参数),导致训练成本与部署资源难以承受。本研究提出DeepGen 1.0——一个仅需50亿参数的轻量化统一模型,其综合能力可媲美甚至超越规模更大的同类模型。为克服紧凑模型在语义理解与细粒度控制方面的局限,我们创新性地提出堆叠通道桥接技术(SCB),该深度对齐框架通过提取视觉语言模型多层特征,并与可学习的"思维令牌"融合,为生成主干网络提供结构化、富含推理逻辑的引导。我们进一步设计了以数据为中心的渐进式三阶段训练策略:(1)基于大规模图文对及编辑三元组的对齐预训练,实现视觉语言模型与扩散Transformer的表征同步;(2)在高质量混合任务集上进行联合监督微调,涵盖生成、编辑与推理任务以培养全能能力;(3)采用混合奖励引导策略优化的强化学习,通过融合多类奖励函数与监督信号,在保持训练稳定性、避免视觉伪影的同时,显著提升生成质量与人类偏好对齐度。尽管仅使用约5000万样本进行训练,DeepGen 1.0在多项基准测试中表现领先:在WISE基准上以28%优势超越800亿参数的HunyuanImage,在UniREditBench上以37%优势超越270亿参数的Qwen-Image-Edit。通过开源训练代码、模型权重及数据集,我们为统一多模态研究提供了高效高性能的民主化替代方案。
English
Current unified multimodal models for image generation and editing typically rely on massive parameter scales (e.g., >10B), entailing prohibitive training costs and deployment footprints. In this work, we present DeepGen 1.0, a lightweight 5B unified model that achieves comprehensive capabilities competitive with or surpassing much larger counterparts. To overcome the limitations of compact models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable 'think tokens' to provide the generative backbone with structured, reasoning-rich guidance. We further design a data-centric training strategy spanning three progressive stages: (1) Alignment Pre-training on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) Joint Supervised Fine-tuning on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) Reinforcement Learning with MR-GRPO, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts. Despite being trained on only ~50M samples, DeepGen 1.0 achieves leading performance across diverse benchmarks, surpassing the 80B HunyuanImage by 28% on WISE and the 27B Qwen-Image-Edit by 37% on UniREditBench. By open-sourcing our training code, weights, and datasets, we provide an efficient, high-performance alternative to democratize unified multimodal research.
PDF601February 14, 2026