Z-Image:基于单流扩散Transformer的高效图像生成基础模型
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
November 27, 2025
作者: Z-Image Team, Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Shijie Huang, Zhaohui Hou, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Feng Yu, Chi Zhang, Shifeng Zhang, Shilin Zhou
cs.AI
摘要
当前高性能图像生成模型领域主要由Nano Banana Pro、Seedream 4.0等专有系统主导。主流开源替代方案如Qwen-Image、Hunyuan-Image-3.0和FLUX.2普遍具有庞大的参数量级(200亿至800亿),导致其在消费级硬件上难以进行推理与微调。为填补这一空白,我们提出Z-Image——基于可扩展单流扩散Transformer架构的60亿参数高效基础生成模型,该设计突破了“不计代价堆叠规模”的传统范式。通过系统优化从数据基础设施到训练流程的完整生命周期,我们仅用31.4万H800 GPU小时(约合63万美元)即完成全流程训练。结合奖励训练后的少步蒸馏方案进一步产出Z-Image-Turbo,既能在企业级H800 GPU上实现亚秒级推理延迟,又兼容消费级硬件(显存<16GB)。我们的全预训练范式还高效培育出具备卓越指令跟随能力的编辑模型Z-Image-Edit。定性与定量实验表明,该模型在多维度性能上达到或超越主流竞品。尤为突出的是,Z-Image在写实图像生成与双语文本渲染方面展现出匹敌顶级商业模型的能力,印证了大幅降低计算成本仍可实现尖端性能的可行性。我们公开代码、权重及在线演示,以推动可访问、低成本且性能领先的生成模型发展。
English
The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (<16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.