混元-DiT：具有细粒度中文理解能力的强大多分辨率扩散Transformer

摘要

我们提出了混元-DiT，这是一个具有对英语和中文进行细粒度理解的文本到图像扩散变压器。为构建混元-DiT，我们精心设计了变压器结构、文本编码器和位置编码。我们还从头开始构建了整个数据管道，以更新和评估数据，用于迭代模型优化。为了进行细粒度语言理解，我们训练了一个多模态大型语言模型，以完善图像的描述。最后，混元-DiT能够与用户进行多轮多模态对话，根据上下文生成和完善图像。通过我们的整体人类评估协议，超过50名专业人类评估者，混元-DiT在中文到图像生成方面相较于其他开源模型树立了新的技术水平。代码和预训练模型可在 github.com/Tencent/HunyuanDiT 上公开获取。

English

We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models. Code and pretrained models are publicly available at github.com/Tencent/HunyuanDiT

混元-DiT：具有细粒度中文理解能力的强大多分辨率扩散Transformer

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

摘要

Support