Qwen-Image-Flash: 객관적 설계를 넘어서

초록

소수 단계 증류는 고급 시각적 생성 모델을 가속화하기 위한 효과적인 전략이 되었으나, 기존 연구는 주로 증류 목적 함수에 집중해 왔다. 본 연구에서는 소수 단계 증류를 보완적 관점에서 재조명하며, 학생 모델의 성능을 결정적으로 좌우하는 훈련 레시피에 초점을 맞춘다. Qwen-Image-2.0을 대표 사례로 삼아, 통합 텍스트-이미지 생성 및 명령 기반 이미지 편집 증류에서 세 가지 요소, 즉 데이터 구성, 교사 안내, 작업 혼합을 체계적으로 조사한다. 실증 분석 결과, 직관적이지 않은 여러 행동 양상이 드러났으며, 이는 Qwen-Image-Flash 개발의 동기가 되었다. 전반적으로, 본 연구의 결과는 효과적인 소수 단계 증류를 위해 신중하게 설계된 목적 함수뿐만 아니라, 더 넓은 훈련 파이프라인의 원칙적인 구성이 필요함을 시사한다.

English

Few-step distillation has become an effective strategy for accelerating advanced visual generative models, yet prior work has largely focused on distillation objectives. In this work, we revisit few-step distillation from a complementary perspective, focusing on the training recipe that critically shapes student performance. Using Qwen-Image-2.0 as a representative case, we systematically investigate three factors in unified text-to-image generation and instruction-guided image editing distillation: data composition, teacher guidance, and task mixture. Our empirical analysis reveals several non-obvious behaviors, which motivate the development of Qwen-Image-Flash. Overall, our results suggest that effective few-step distillation requires not only carefully designed objectives, but also principled organization of the broader training pipeline.