ChatPaper.aiChatPaper

混元圖像3.0技術報告

HunyuanImage 3.0 Technical Report

September 28, 2025
作者: Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, Tiankai Hang, Duojun Huang, Jie Jiang, Zhengkai Jiang, Weijie Kong, Changlin Li, Donghao Li, Junzhe Li, Xin Li, Yang Li, Zhenxi Li, Zhimin Li, Jiaxin Lin, Linus, Lucaz Liu, Shu Liu, Songtao Liu, Yu Liu, Yuhong Liu, Yanxin Long, Fanbin Lu, Qinglin Lu, Yuyang Peng, Yuanbo Peng, Xiangwei Shen, Yixuan Shi, Jiale Tao, Yangyu Tao, Qi Tian, Pengfei Wan, Chunyu Wang, Kai Wang, Lei Wang, Linqing Wang, Lucas Wang, Qixun Wang, Weiyan Wang, Hao Wen, Bing Wu, Jianbing Wu, Yue Wu, Senhao Xie, Fang Yang, Miles Yang, Xiaofeng Yang, Xuan Yang, Zhantao Yang, Jingmiao Yu, Zheng Yuan, Chao Zhang, Jian-Wei Zhang, Peizhen Zhang, Shi-Xue Zhang, Tao Zhang, Weigang Zhang, Yepeng Zhang, Yingfang Zhang, Zihao Zhang, Zijian Zhang, Penghao Zhao, Zhiyuan Zhao, Xuefei Zhe, Jianchen Zhu, Zhao Zhong
cs.AI

摘要

我们推出HunyuanImage 3.0,这是一个原生多模态模型,它将多模态理解与生成统一于自回归框架之中,并公开了其图像生成模块。HunyuanImage 3.0的成就依赖于多个关键组成部分,包括精细的数据整理、先进的架构设计、原生的思维链模式、渐进式模型预训练、积极的模型后训练,以及支持大规模训练与推理的高效基础设施。凭借这些进步,我们成功训练了一个包含超过800亿参数的专家混合模型(MoE),在推理时每个令牌激活130亿参数,使其成为迄今为止最大且最强大的开源图像生成模型。我们进行了广泛的实验,自动与人工评估的文本-图像对齐及视觉质量结果表明,HunyuanImage 3.0可与以往最先进的模型相媲美。通过发布HunyuanImage 3.0的代码与权重,我们旨在让社区能够基于这一顶尖基础模型探索新思路,促进一个充满活力与生机的多模态生态系统。所有开源资源均公开于https://github.com/Tencent-Hunyuan/HunyuanImage-3.0。
English
We present HunyuanImage 3.0, a native multimodal model that unifies multimodal understanding and generation within an autoregressive framework, with its image generation module publicly available. The achievement of HunyuanImage 3.0 relies on several key components, including meticulous data curation, advanced architecture design, a native Chain-of-Thoughts schema, progressive model pre-training, aggressive model post-training, and an efficient infrastructure that enables large-scale training and inference. With these advancements, we successfully trained a Mixture-of-Experts (MoE) model comprising over 80 billion parameters in total, with 13 billion parameters activated per token during inference, making it the largest and most powerful open-source image generative model to date. We conducted extensive experiments and the results of automatic and human evaluation of text-image alignment and visual quality demonstrate that HunyuanImage 3.0 rivals previous state-of-the-art models. By releasing the code and weights of HunyuanImage 3.0, we aim to enable the community to explore new ideas with a state-of-the-art foundation model, fostering a dynamic and vibrant multimodal ecosystem. All open source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanImage-3.0
PDF112September 30, 2025