上下文编辑:基于大规模扩散变换器的上下文生成实现教学图像编辑
In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer
April 29, 2025
作者: Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, Yi Yang
cs.AI
摘要
基于指令的图像编辑技术通过自然语言提示实现了强大的图像修改能力,然而现有方法在精度与效率之间面临权衡。微调方法需要大量计算资源和数据集,而免训练技术则在指令理解和编辑质量上存在不足。我们通过利用大规模扩散变换器(DiT)增强的生成能力和固有的上下文感知能力,解决了这一困境。我们的解决方案提出了三项创新:(1)一种上下文内编辑框架,采用上下文提示实现零样本指令遵循,避免结构改动;(2)一种LoRA-MoE混合调优策略,通过高效适应和动态专家路由增强灵活性,无需大规模重新训练;(3)一种基于视觉语言模型(VLMs)的早期过滤推理时间缩放方法,提前选择更优的初始噪声,提升编辑质量。广泛的评估表明,我们的方法在仅需0.5%训练数据和1%可训练参数的情况下,超越了现有最先进技术。这项工作确立了一种新范式,实现了高精度且高效的指令引导编辑。代码和演示可在https://river-zhang.github.io/ICEdit-gh-pages/获取。
English
Instruction-based image editing enables robust image modification via natural
language prompts, yet current methods face a precision-efficiency tradeoff.
Fine-tuning methods demand significant computational resources and large
datasets, while training-free techniques struggle with instruction
comprehension and edit quality. We resolve this dilemma by leveraging
large-scale Diffusion Transformer (DiT)' enhanced generation capacity and
native contextual awareness. Our solution introduces three contributions: (1)
an in-context editing framework for zero-shot instruction compliance using
in-context prompting, avoiding structural changes; (2) a LoRA-MoE hybrid tuning
strategy that enhances flexibility with efficient adaptation and dynamic expert
routing, without extensive retraining; and (3) an early filter inference-time
scaling method using vision-language models (VLMs) to select better initial
noise early, improving edit quality. Extensive evaluations demonstrate our
method's superiority: it outperforms state-of-the-art approaches while
requiring only 0.5% training data and 1% trainable parameters compared to
conventional baselines. This work establishes a new paradigm that enables
high-precision yet efficient instruction-guided editing. Codes and demos can be
found in https://river-zhang.github.io/ICEdit-gh-pages/.