i1：構建強大文本到圖像模型的簡單且完全開放的方法

摘要

擴散模型持續推動文生圖生成領域的進展。然而，將近期進展歸因於特定的建模與數據選擇仍具挑戰性：目前最先進的開源權重模型提供的消融研究有限，且未公開其訓練資料與完整訓練細節。研究社群需要完全開放（權重、資料與程式碼）的模型作為進一步研究的基礎；然而現有的完全開放模型在效能上仍顯著落後於領先模型。在本專案中，我們透過300多次受控實驗（總計超過70萬個TPU v6e小時），系統性地探討文生圖擴散訓練與推論中的建模與資料設計選擇。實驗結果凸顯多項實證發現（例如在混合策展資料集時，等權重為強效預設策略）與簡潔設計決策（例如擴大文字編碼器適配器可在極少參數增加下提升效能），有助於訓練高效能模型。根據這些洞見，我們僅使用公開資料集訓練了參數量為30億的i1文生圖擴散模型。i1在五個代表性基準（GenEval、DPG、PRISM、CVTG-2K與LongText）上與領先模型競爭，並在平均表現上超越現有最佳完全開放模型29.5個絕對百分點。我們提供i1模型檢查點、訓練與推論程式碼，以及資料處理流程。綜合而言，我們的研究發現與i1配方為未來文生圖擴散模型的開放研究奠定了實用基礎。程式碼已公開在 https://github.com/zlab-princeton/i1。

English

Diffusion models have consistently driven progress in text-to-image generation. However, it is challenging to attribute recent progress to specific modeling and data choices: state-of-the-art open-weight models provide limited ablations, and do not disclose their training data and full training details. The research community needs fully open (weights, data, and code) models as a foundation for further research; yet existing fully open models still fall significantly short of leading models in performance. In this project, we conduct a systematic investigation of the modeling and data design choices in text-to-image diffusion training and inference with 300+ controlled experiments totaling 700K+ TPU v6e hours. Our experiments highlight several empirical findings (e.g., equal weighting is a strong default for mixing curated datasets) and simple design decisions (e.g., larger text encoder adapters improve performance with minimal added parameters) for training strong models. Guided by these insights, we train i1, a 3B-parameter text-to-image diffusion model using only publicly available datasets. i1 is competitive with leading models on five representative benchmarks (GenEval, DPG, PRISM, CVTG-2K, and LongText), and outperforms the best existing fully open model by 29.5 absolute percentage points on average. We provide the i1 checkpoints, training and inference code, and the data processing pipeline. Together, our findings and the i1 recipe establish a practical foundation for future open research in text-to-image diffusion models. Our code is available at https://github.com/zlab-princeton/i1.