i1：一个简单且完全开放的用于构建强大文本到图像模型的方案

摘要

扩散模型持续推动了文本到图像生成领域的进步。然而，将近期进展归因于特定建模与数据选择仍面临挑战：最先进的开放权重模型仅提供有限的消融实验，且未公开其训练数据与完整训练细节。研究界需要完全开放（包含权重、数据与代码）的模型作为进一步研究的基础，但现有完全开放模型在性能上仍显著落后于领先模型。在本项目中，我们通过300余组受控实验（累计超过70万TPU v6e小时）系统性地研究了文本到图像扩散训练与推理中的建模及数据设计选择。实验揭示了多项实证发现（例如，等权重是混合精选数据集的强效默认策略）与简单设计决策（例如，更大的文本编码器适配器能以极少的参数增加提升性能），从而指导强模型的训练。基于这些洞见，我们训练了仅使用公开数据集的30亿参数文本到图像扩散模型i1。在五个代表性基准（GenEval、DPG、PRISM、CVTG-2K与LongText）上，i1与领先模型性能相当，并在五个基准上平均超越最佳现有完全开放模型29.5个百分点。我们提供i1模型检查点、训练与推理代码，以及数据处理流程。我们的发现与i1方案共同为未来文本到图像扩散模型的开放研究奠定了实践基础。代码已开源至https://github.com/zlab-princeton/i1。

English

Diffusion models have consistently driven progress in text-to-image generation. However, it is challenging to attribute recent progress to specific modeling and data choices: state-of-the-art open-weight models provide limited ablations, and do not disclose their training data and full training details. The research community needs fully open (weights, data, and code) models as a foundation for further research; yet existing fully open models still fall significantly short of leading models in performance. In this project, we conduct a systematic investigation of the modeling and data design choices in text-to-image diffusion training and inference with 300+ controlled experiments totaling 700K+ TPU v6e hours. Our experiments highlight several empirical findings (e.g., equal weighting is a strong default for mixing curated datasets) and simple design decisions (e.g., larger text encoder adapters improve performance with minimal added parameters) for training strong models. Guided by these insights, we train i1, a 3B-parameter text-to-image diffusion model using only publicly available datasets. i1 is competitive with leading models on five representative benchmarks (GenEval, DPG, PRISM, CVTG-2K, and LongText), and outperforms the best existing fully open model by 29.5 absolute percentage points on average. We provide the i1 checkpoints, training and inference code, and the data processing pipeline. Together, our findings and the i1 recipe establish a practical foundation for future open research in text-to-image diffusion models. Our code is available at https://github.com/zlab-princeton/i1.