Z-Image:基於單流擴散轉換器的高效影像生成基礎模型
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
November 27, 2025
作者: Z-Image Team, Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Shijie Huang, Zhaohui Hou, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Feng Yu, Chi Zhang, Shifeng Zhang, Shilin Zhou
cs.AI
摘要
當前高性能影像生成模型領域主要由專有系統主導,例如Nano Banana Pro和Seedream 4.0。領先的開源替代方案(如Qwen-Image、Hunyuan-Image-3.0和FLUX.2)普遍具有龐大參數量(200億至800億),導致其在消費級硬件上進行推理和微調的可行性極低。為解決這一困境,我們提出Z-Image——基於可擴展單流擴散變換器(S3-DiT)架構的高效60億參數基礎生成模型,該架構挑戰了「不計成本擴張規模」的傳統範式。通過系統性優化完整模型生命週期(從精選數據基建到流線型訓練課程),我們僅耗費31.4萬H800 GPU小時(約63萬美元)即完成全流程訓練。結合獎勵訓練後的少步蒸餾方案進一步衍生出Z-Image-Turbo,不僅在企業級H800 GPU上實現亞秒級推理延遲,更兼容消費級硬件(顯存<16GB)。此外,我們的全域預訓練範式還高效培育出Z-Image-Edit編輯模型,其具備卓越的指令跟隨能力。定性與定量實驗表明,我們的模型在多維度上達到甚至超越主流競品水準。尤為突出的是,Z-Image在寫實影像生成與雙語文本渲染方面展現出頂級水準,成果可媲美頂尖商業模型,充分證明大幅降低計算成本仍可實現尖端性能。我們公開釋出程式碼、權重及線上演示,以推動可訪問、低成本且具頂尖水準的生成模型發展。
English
The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (<16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.