i1: 강력한 텍스트-이미지 모델을 위한 간단하고 완전히 공개된 레시피

초록

확산 모델은 텍스트-이미지 생성 분야에서 꾸준히 진전을 이끌어 왔습니다. 그러나 최근의 진전을 특정 모델링 및 데이터 선택에 귀속시키는 것은 어렵습니다. 최첨단 오픈웨이트 모델은 제한된 절제 연구만을 제공하고, 훈련 데이터와 전체 훈련 세부 사항을 공개하지 않기 때문입니다. 연구 커뮤니티는 추가 연구를 위한 기반으로서 완전히 공개된(가중치, 데이터, 코드) 모델이 필요하지만, 기존의 완전 공개 모델은 주요 모델에 비해 성능이 크게 뒤떨어집니다. 본 프로젝트에서는 300회 이상의 제어 실험(총 70만 TPU v6e 시간 소요)을 통해 텍스트-이미지 확산 훈련 및 추론에서의 모델링 및 데이터 설계 선택지를 체계적으로 조사합니다. 본 실험은 강력한 모델을 훈련하기 위한 몇 가지 경험적 발견(예: 균등 가중치는 큐레이팅된 데이터셋 혼합에 강력한 기본값임)과 간단한 설계 결정(예: 더 큰 텍스트 인코더 어댑터가 최소한의 추가 파라미터로 성능을 향상시킴)을 강조합니다. 이러한 통찰에 따라 공개적으로 이용 가능한 데이터셋만을 사용하여 30억(3B) 파라미터의 텍스트-이미지 확산 모델인 i1을 훈련시킵니다. i1은 다섯 가지 대표 벤치마크(GenEval, DPG, PRISM, CVTG-2K, LongText)에서 주요 모델과 경쟁력을 갖추며, 평균적으로 기존 최고의 완전 공개 모델보다 29.5%p 높은 성능을 보입니다. i1 체크포인트, 훈련 및 추론 코드, 데이터 처리 파이프라인을 제공합니다. 본 연구의 결과와 i1 레시피는 향후 텍스트-이미지 확산 모델에 대한 공개 연구를 위한 실질적인 기반을 마련합니다. 코드는 https://github.com/zlab-princeton/i1에서 확인할 수 있습니다.

English

Diffusion models have consistently driven progress in text-to-image generation. However, it is challenging to attribute recent progress to specific modeling and data choices: state-of-the-art open-weight models provide limited ablations, and do not disclose their training data and full training details. The research community needs fully open (weights, data, and code) models as a foundation for further research; yet existing fully open models still fall significantly short of leading models in performance. In this project, we conduct a systematic investigation of the modeling and data design choices in text-to-image diffusion training and inference with 300+ controlled experiments totaling 700K+ TPU v6e hours. Our experiments highlight several empirical findings (e.g., equal weighting is a strong default for mixing curated datasets) and simple design decisions (e.g., larger text encoder adapters improve performance with minimal added parameters) for training strong models. Guided by these insights, we train i1, a 3B-parameter text-to-image diffusion model using only publicly available datasets. i1 is competitive with leading models on five representative benchmarks (GenEval, DPG, PRISM, CVTG-2K, and LongText), and outperforms the best existing fully open model by 29.5 absolute percentage points on average. We provide the i1 checkpoints, training and inference code, and the data processing pipeline. Together, our findings and the i1 recipe establish a practical foundation for future open research in text-to-image diffusion models. Our code is available at https://github.com/zlab-princeton/i1.