ChatPaper.aiChatPaper

Skywork UniPic:视觉理解与生成的统一自回归建模

Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation

August 5, 2025
作者: Peiyu Wang, Yi Peng, Yimeng Gan, Liang Hu, Tianyidan Xie, Xiaokun Wang, Yichen Wei, Chuanxin Tang, Bo Zhu, Changshi Li, Hongyang Wei, Eric Li, Xuchen Song, Yang Liu, Yahui Zhou
cs.AI

摘要

我们推出Skywork UniPic,这是一个拥有15亿参数的自回归模型,它将图像理解、文本到图像生成以及图像编辑统一在单一架构中——无需任务特定的适配器或模块间连接器——并展示了紧凑的多模态系统能够在商用硬件上实现最先进的性能。Skywork UniPic在GenEval评分中达到0.86,超越了大多数现有统一模型;在DPG-Bench复杂生成任务中创下85.5的新纪录;在GEditBench-EN和ImgEdit-Bench的图像编辑任务中分别获得5.83和3.49的分数;并且能够在不到15GB的GPU内存(例如RTX 4090)下生成1024x1024分辨率的图像。(1) 采用解耦编码策略,利用掩码自回归编码器进行合成,SigLIP2编码器进行理解,两者共同输入共享的自回归解码器;(2) 实施渐进式、分辨率感知的训练计划,从256x256逐步扩展到1024x1024,同时动态解冻参数以平衡模型容量与稳定性;(3) 精心策划了规模达1亿的数据集,并通过任务特定的奖励模型增强,以优化生成和编辑目标。通过证明高保真多模态集成无需承担过高的资源需求,Skywork UniPic为可部署的高保真多模态AI树立了实用范式。代码和权重已在https://huggingface.co/Skywork/Skywork-UniPic-1.5B公开。
English
We introduce Skywork UniPic, a 1.5 billion-parameter autoregressive model that unifies image understanding, text-to-image generation, and image editing within a single architecture-eliminating the need for task-specific adapters or inter-module connectors-and demonstrate that compact multimodal systems can achieve state-of-the-art performance on commodity hardware. Skywork UniPic achieves a GenEval score of 0.86, surpassing most existing unified models; sets a new DPG-Bench complex-generation record of 85.5; attains 5.83 on GEditBench-EN and 3.49 on ImgEdit-Bench for image editing; and generates 1024 x 1024 images with under 15 GB of GPU memory (e.g., RTX 4090). (1) a decoupled encoding strategy that leverages a masked autoregressive encoder for synthesis and a SigLIP2 encoder for understanding, all feeding a shared autoregressive decoder; (2) a progressive, resolution-aware training schedule scaling from 256 x 256 to 1024 x 1024 while dynamically unfreezing parameters to balance capacity and stability; and (3) meticulously curated, 100 million-scale datasets augmented with task-specific reward models to refine generation and editing objectives. By demonstrating that high-fidelity multimodal integration need not incur prohibitive resource demands, Skywork UniPic establishes a practical paradigm for deployable, high-fidelity multimodal AI. Code and weights are publicly available at https://huggingface.co/Skywork/Skywork-UniPic-1.5B.
PDF502August 6, 2025