Seedream 4.0: 차세대 멀티모달 이미지 생성 기술을 향하여

초록

우리는 텍스트-이미지(T2I) 합성, 이미지 편집, 다중 이미지 구성을 단일 프레임워크 내에서 통합한 효율적이고 고성능의 멀티모달 이미지 생성 시스템인 Seedream 4.0을 소개합니다. 우리는 강력한 VAE(Variational Autoencoder)를 갖춘 고효율 디퓨전 트랜스포머를 개발하여 이미지 토큰의 수를 상당히 줄였습니다. 이를 통해 모델의 효율적인 학습이 가능하며, 고해상도 이미지(예: 1K-4K)를 빠르게 생성할 수 있습니다. Seedream 4.0은 다양한 분류체계와 지식 중심 개념을 아우르는 수십억 개의 텍스트-이미지 쌍으로 사전 학습되었습니다. 수백 개의 수직 시나리오에서의 포괄적인 데이터 수집과 최적화된 전략을 통해 안정적이고 대규모의 학습이 가능하며, 강력한 일반화 성능을 보장합니다. 세심하게 미세 조정된 VLM(Vision-Language Model)을 통합하여 T2I와 이미지 편집 작업을 동시에 학습하기 위한 멀티모달 사후 학습을 수행합니다. 추론 가속을 위해 적대적 증류, 분포 매칭, 양자화 및 스펙티브 디코딩을 통합했습니다. 이를 통해 2K 이미지를 생성하는 데 최대 1.8초의 추론 시간을 달성했습니다(LLM/VLM을 PE 모델로 사용하지 않은 경우). 포괄적인 평가 결과, Seedream 4.0은 T2I 및 멀티모달 이미지 편집에서 최첨단 성능을 달성할 수 있음이 확인되었습니다. 특히, 정밀한 이미지 편집 및 컨텍스트 내 추론과 같은 복잡한 작업에서 탁월한 멀티모달 능력을 보여주며, 다중 이미지 참조가 가능하고 여러 출력 이미지를 생성할 수 있습니다. 이는 기존의 T2I 시스템을 더욱 상호작용적이고 다차원적인 창작 도구로 확장하여 생성형 AI의 경계를 창의성과 전문적 응용 분야 모두에서 넓혀줍니다. Seedream 4.0은 현재 https://www.volcengine.com/experience/ark?launch=seedream에서 이용 가능합니다.

English

We introduce Seedream 4.0, an efficient and high-performance multimodal image generation system that unifies text-to-image (T2I) synthesis, image editing, and multi-image composition within a single framework. We develop a highly efficient diffusion transformer with a powerful VAE which also can reduce the number of image tokens considerably. This allows for efficient training of our model, and enables it to fast generate native high-resolution images (e.g., 1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning diverse taxonomies and knowledge-centric concepts. Comprehensive data collection across hundreds of vertical scenarios, coupled with optimized strategies, ensures stable and large-scale training, with strong generalization. By incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training both T2I and image editing tasks jointly. For inference acceleration, we integrate adversarial distillation, distribution matching, and quantization, as well as speculative decoding. It achieves an inference time of up to 1.8 seconds for generating a 2K image (without a LLM/VLM as PE model). Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results on both T2I and multimodal image editing. In particular, it demonstrates exceptional multimodal capabilities in complex tasks, including precise image editing and in-context reasoning, and also allows for multi-image reference, and can generate multiple output images. This extends traditional T2I systems into an more interactive and multidimensional creative tool, pushing the boundary of generative AI for both creativity and professional applications. Seedream 4.0 is now accessible on https://www.volcengine.com/experience/ark?launch=seedream.