天工UniPic:面向视觉理解与生成的统一自回归建模
Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation
August 5, 2025
作者: Peiyu Wang, Yi Peng, Yimeng Gan, Liang Hu, Tianyidan Xie, Xiaokun Wang, Yichen Wei, Chuanxin Tang, Bo Zhu, Changshi Li, Hongyang Wei, Eric Li, Xuchen Song, Yang Liu, Yahui Zhou
cs.AI
摘要
我們推出Skywork UniPic,這是一個擁有15億參數的自迴歸模型,它將圖像理解、文本到圖像生成以及圖像編輯功能整合於單一架構之中,無需特定任務的適配器或模塊間連接器,並展示了緊湊型多模態系統在商用硬件上也能達到業界領先的性能。Skywork UniPi在GenEval評分中取得了0.86的成績,超越了大多數現有的統一模型;在DPG-Bench複雜生成任務中創下了85.5的新紀錄;在圖像編輯方面,於GEditBench-EN上獲得5.83分,在ImgEdit-Bench上獲得3.49分;並且在不到15GB的GPU顯存(例如RTX 4090)下生成1024x1024分辨率的圖像。其核心技術包括:(1) 解耦編碼策略,利用掩碼自迴歸編碼器進行合成,SigLIP2編碼器進行理解,共同供給一個共享的自迴歸解碼器;(2) 漸進式、分辨率感知的訓練計劃,從256x256逐步擴展至1024x1024,同時動態解凍參數以平衡模型能力與穩定性;(3) 精心挑選的、規模達1億的數據集,結合特定任務的獎勵模型,以精煉生成與編輯目標。通過證明高保真多模態集成不必伴隨過高的資源需求,Skywork UniPic為可部署的高保真多模態AI樹立了實用範式。代碼及權重已公開於https://huggingface.co/Skywork/Skywork-UniPic-1.5B。
English
We introduce Skywork UniPic, a 1.5 billion-parameter autoregressive model
that unifies image understanding, text-to-image generation, and image editing
within a single architecture-eliminating the need for task-specific adapters or
inter-module connectors-and demonstrate that compact multimodal systems can
achieve state-of-the-art performance on commodity hardware. Skywork UniPic
achieves a GenEval score of 0.86, surpassing most existing unified models; sets
a new DPG-Bench complex-generation record of 85.5; attains 5.83 on
GEditBench-EN and 3.49 on ImgEdit-Bench for image editing; and generates 1024 x
1024 images with under 15 GB of GPU memory (e.g., RTX 4090). (1) a decoupled
encoding strategy that leverages a masked autoregressive encoder for synthesis
and a SigLIP2 encoder for understanding, all feeding a shared autoregressive
decoder; (2) a progressive, resolution-aware training schedule scaling from 256
x 256 to 1024 x 1024 while dynamically unfreezing parameters to balance
capacity and stability; and (3) meticulously curated, 100 million-scale
datasets augmented with task-specific reward models to refine generation and
editing objectives. By demonstrating that high-fidelity multimodal integration
need not incur prohibitive resource demands, Skywork UniPic establishes a
practical paradigm for deployable, high-fidelity multimodal AI. Code and
weights are publicly available at
https://huggingface.co/Skywork/Skywork-UniPic-1.5B.