Qwen-Image 기술 보고서

초록

우리는 Qwen 시리즈의 이미지 생성 기반 모델인 Qwen-Image를 소개하며, 이 모델이 복잡한 텍스트 렌더링과 정밀한 이미지 편집에서 상당한 진전을 이루었음을 보여줍니다. 복잡한 텍스트 렌더링의 과제를 해결하기 위해, 우리는 대규모 데이터 수집, 필터링, 주석, 합성 및 균형 조정을 포함한 포괄적인 데이터 파이프라인을 설계했습니다. 더 나아가, 비텍스트에서 텍스트 렌더링으로 시작하여 단순한 텍스트 입력에서 복잡한 텍스트 입력으로 진화하고, 점차적으로 단락 수준의 설명으로 확장되는 점진적 훈련 전략을 채택했습니다. 이 커리큘럼 학습 접근법은 모델의 기본 텍스트 렌더링 능력을 크게 향상시켰습니다. 그 결과, Qwen-Image는 영어와 같은 알파벳 언어에서 뛰어난 성능을 보일 뿐만 아니라, 중국어와 같은 더 도전적인 표의 문자 언어에서도 주목할 만한 진전을 이루었습니다. 이미지 편집 일관성을 강화하기 위해, 우리는 전통적인 텍스트-이미지(T2I) 및 텍스트-이미지-이미지(TI2I) 작업뿐만 아니라 이미지-이미지(I2I) 재구성을 포함한 개선된 다중 작업 훈련 패러다임을 도입하여 Qwen2.5-VL과 MMDiT 간의 잠재 표현을 효과적으로 정렬했습니다. 또한, 원본 이미지를 Qwen2.5-VL과 VAE 인코더에 각각 입력하여 의미론적 표현과 재구성적 표현을 별도로 얻었습니다. 이 이중 인코딩 메커니즘은 편집 모듈이 의미 일관성을 유지하면서도 시각적 충실도를 유지하는 균형을 맞출 수 있게 합니다. Qwen-Image는 여러 벤치마크에서 최첨단 성능을 달성하며, 이미지 생성과 편집 모두에서 강력한 능력을 입증했습니다.

English

We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially enhances the model's native text rendering capabilities. As a result, Qwen-Image not only performs exceptionally well in alphabetic languages such as English, but also achieves remarkable progress on more challenging logographic languages like Chinese. To enhance image editing consistency, we introduce an improved multi-task training paradigm that incorporates not only traditional text-to-image (T2I) and text-image-to-image (TI2I) tasks but also image-to-image (I2I) reconstruction, effectively aligning the latent representations between Qwen2.5-VL and MMDiT. Furthermore, we separately feed the original image into Qwen2.5-VL and the VAE encoder to obtain semantic and reconstructive representations, respectively. This dual-encoding mechanism enables the editing module to strike a balance between preserving semantic consistency and maintaining visual fidelity. Qwen-Image achieves state-of-the-art performance, demonstrating its strong capabilities in both image generation and editing across multiple benchmarks.

Qwen-Image 기술 보고서

Qwen-Image Technical Report

초록

Support