Step1X-Edit:通用图像编辑的实用框架
Step1X-Edit: A Practical Framework for General Image Editing
April 24, 2025
作者: Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, Daxin Jiang
cs.AI
摘要
近年来,图像编辑模型取得了显著且迅速的发展。最新发布的多模态前沿模型,如GPT-4o和Gemini2 Flash,展现了极具前景的图像编辑能力。这些模型在满足用户多样化编辑需求方面表现出色,标志着图像处理领域的一大进步。然而,开源算法与这些闭源模型之间仍存在较大差距。因此,本文旨在发布一款名为Step1X-Edit的先进图像编辑模型,其性能可与GPT-4o和Gemini2 Flash等闭源模型相媲美。具体而言,我们采用多模态大语言模型处理参考图像及用户的编辑指令,提取潜在嵌入并与扩散图像解码器结合,以生成目标图像。为训练该模型,我们构建了数据生成管道,生产高质量数据集。评估方面,我们开发了基于真实用户指令的新基准GEdit-Bench。在GEdit-Bench上的实验结果表明,Step1X-Edit大幅超越现有开源基线,并接近领先的专有模型性能,从而为图像编辑领域做出了重要贡献。
English
In recent years, image editing models have witnessed remarkable and rapid
development. The recent unveiling of cutting-edge multimodal models such as
GPT-4o and Gemini2 Flash has introduced highly promising image editing
capabilities. These models demonstrate an impressive aptitude for fulfilling a
vast majority of user-driven editing requirements, marking a significant
advancement in the field of image manipulation. However, there is still a large
gap between the open-source algorithm with these closed-source models. Thus, in
this paper, we aim to release a state-of-the-art image editing model, called
Step1X-Edit, which can provide comparable performance against the closed-source
models like GPT-4o and Gemini2 Flash. More specifically, we adopt the
Multimodal LLM to process the reference image and the user's editing
instruction. A latent embedding has been extracted and integrated with a
diffusion image decoder to obtain the target image. To train the model, we
build a data generation pipeline to produce a high-quality dataset. For
evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world
user instructions. Experimental results on GEdit-Bench demonstrate that
Step1X-Edit outperforms existing open-source baselines by a substantial margin
and approaches the performance of leading proprietary models, thereby making
significant contributions to the field of image editing.Summary
AI-Generated Summary