Qwen-Image-Flash: 客観的デザインを超えて

要旨

数ステップ蒸留は、高度な視覚生成モデルを高速化する効果的な戦略として確立されつつあるが、これまでの研究では主に蒸留の目的関数に焦点が当てられてきた。本研究では、補完的な視点から数ステップ蒸留を再検討し、生徒モデルの性能を決定的に左右する訓練レシピに注目する。Qwen-Image-2.0を代表的な事例として、統合テキスト-to-画像生成と指示誘導型画像編集蒸留における三つの要因、すなわちデータ構成、教師ガイダンス、タスク混合を体系的に調査する。実験的分析により、いくつかの非自明な振る舞いが明らかとなり、これがQwen-Image-Flashの開発へとつながった。全体として、本研究の結果は、効果的な数ステップ蒸留には慎重に設計された目的関数だけでなく、より広範な訓練パイプラインの原理に基づいた組織化が不可欠であることを示している。

English

Few-step distillation has become an effective strategy for accelerating advanced visual generative models, yet prior work has largely focused on distillation objectives. In this work, we revisit few-step distillation from a complementary perspective, focusing on the training recipe that critically shapes student performance. Using Qwen-Image-2.0 as a representative case, we systematically investigate three factors in unified text-to-image generation and instruction-guided image editing distillation: data composition, teacher guidance, and task mixture. Our empirical analysis reveals several non-obvious behaviors, which motivate the development of Qwen-Image-Flash. Overall, our results suggest that effective few-step distillation requires not only carefully designed objectives, but also principled organization of the broader training pipeline.