Step1X-Edit：通用图像编辑的实用框架

摘要

近年来，图像编辑模型取得了显著且迅速的发展。最新发布的多模态前沿模型，如GPT-4o和Gemini2 Flash，展现了极具前景的图像编辑能力。这些模型在满足用户多样化编辑需求方面表现出色，标志着图像处理领域的一大进步。然而，开源算法与这些闭源模型之间仍存在较大差距。因此，本文旨在发布一款名为Step1X-Edit的先进图像编辑模型，其性能可与GPT-4o和Gemini2 Flash等闭源模型相媲美。具体而言，我们采用多模态大语言模型处理参考图像及用户的编辑指令，提取潜在嵌入并与扩散图像解码器结合，以生成目标图像。为训练该模型，我们构建了数据生成管道，生产高质量数据集。评估方面，我们开发了基于真实用户指令的新基准GEdit-Bench。在GEdit-Bench上的实验结果表明，Step1X-Edit大幅超越现有开源基线，并接近领先的专有模型性能，从而为图像编辑领域做出了重要贡献。

English

In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation. However, there is still a large gap between the open-source algorithm with these closed-source models. Thus, in this paper, we aim to release a state-of-the-art image editing model, called Step1X-Edit, which can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash. More specifically, we adopt the Multimodal LLM to process the reference image and the user's editing instruction. A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image. To train the model, we build a data generation pipeline to produce a high-quality dataset. For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions. Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.