混元-DiT:具有细粒度中文理解能力的强大多分辨率扩散Transformer
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
May 14, 2024
作者: Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue, Yangyu Tao, Jianchen Zhu, Kai Liu, Sihuan Lin, Yifu Sun, Yun Li, Dongdong Wang, Mingtao Chen, Zhichao Hu, Xiao Xiao, Yan Chen, Yuhong Liu, Wei Liu, Di Wang, Yong Yang, Jie Jiang, Qinglin Lu
cs.AI
摘要
我们提出了混元-DiT,这是一个具有对英语和中文进行细粒度理解的文本到图像扩散变压器。为构建混元-DiT,我们精心设计了变压器结构、文本编码器和位置编码。我们还从头开始构建了整个数据管道,以更新和评估数据,用于迭代模型优化。为了进行细粒度语言理解,我们训练了一个多模态大型语言模型,以完善图像的描述。最后,混元-DiT能够与用户进行多轮多模态对话,根据上下文生成和完善图像。通过我们的整体人类评估协议,超过50名专业人类评估者,混元-DiT在中文到图像生成方面相较于其他开源模型树立了新的技术水平。代码和预训练模型可在 github.com/Tencent/HunyuanDiT 上公开获取。
English
We present Hunyuan-DiT, a text-to-image diffusion transformer with
fine-grained understanding of both English and Chinese. To construct
Hunyuan-DiT, we carefully design the transformer structure, text encoder, and
positional encoding. We also build from scratch a whole data pipeline to update
and evaluate data for iterative model optimization. For fine-grained language
understanding, we train a Multimodal Large Language Model to refine the
captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal
dialogue with users, generating and refining images according to the context.
Through our holistic human evaluation protocol with more than 50 professional
human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image
generation compared with other open-source models. Code and pretrained models
are publicly available at github.com/Tencent/HunyuanDiTSummary
AI-Generated Summary