混元-DiT:具有細粒度中文理解的強大多分辨擴散Transformer
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
May 14, 2024
作者: Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue, Yangyu Tao, Jianchen Zhu, Kai Liu, Sihuan Lin, Yifu Sun, Yun Li, Dongdong Wang, Mingtao Chen, Zhichao Hu, Xiao Xiao, Yan Chen, Yuhong Liu, Wei Liu, Di Wang, Yong Yang, Jie Jiang, Qinglin Lu
cs.AI
摘要
我們提出了混元-DiT,一種具有對英文和中文進行細粒度理解的文本到圖像擴散Transformer。為了構建混元-DiT,我們精心設計了Transformer結構、文本編碼器和位置編碼。我們還從頭開始構建了整個數據管道,以更新和評估數據以進行迭代模型優化。為了進行細粒度語言理解,我們訓練了一個多模式大型語言模型,以完善圖像的標題。最後,混元-DiT可以與用戶進行多輪多模式對話,根據上下文生成和完善圖像。通過我們的整體人類評估協議,超過50名專業人員評估者,與其他開源模型相比,混元-DiT在中文到圖像生成方面設立了新的最先進水平。代碼和預訓練模型可在github.com/Tencent/HunyuanDiT上公開獲得。
English
We present Hunyuan-DiT, a text-to-image diffusion transformer with
fine-grained understanding of both English and Chinese. To construct
Hunyuan-DiT, we carefully design the transformer structure, text encoder, and
positional encoding. We also build from scratch a whole data pipeline to update
and evaluate data for iterative model optimization. For fine-grained language
understanding, we train a Multimodal Large Language Model to refine the
captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal
dialogue with users, generating and refining images according to the context.
Through our holistic human evaluation protocol with more than 50 professional
human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image
generation compared with other open-source models. Code and pretrained models
are publicly available at github.com/Tencent/HunyuanDiTSummary
AI-Generated Summary