情境内编辑:基于大规模扩散变换器的情境生成实现教学图像编辑
In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer
April 29, 2025
作者: Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, Yi Yang
cs.AI
摘要
基於指令的圖像編輯技術通過自然語言提示實現了強大的圖像修改能力,然而現有方法在精度與效率之間存在權衡。精細調校方法需要大量計算資源和大規模數據集,而無訓練技術則在指令理解與編輯質量上表現欠佳。我們通過利用大規模擴散變換器(DiT)的增強生成能力及內在上下文感知,解決了這一困境。我們的解決方案提出了三項創新:(1)一種基於上下文提示的零樣本指令遵循框架,避免了結構性變更;(2)一種LoRA-MoE混合調優策略,通過高效適應與動態專家路由增強了靈活性,無需大規模重新訓練;(3)一種利用視覺語言模型(VLM)的早期過濾推理時間縮放方法,提前選擇更好的初始噪聲,從而提升編輯質量。廣泛的評估顯示了我們方法的優越性:它在僅需0.5%訓練數據和1%可訓練參數的情況下,超越了現有最先進的方法。這項工作建立了一種新範式,實現了高精度且高效的指令引導編輯。代碼與演示可在https://river-zhang.github.io/ICEdit-gh-pages/找到。
English
Instruction-based image editing enables robust image modification via natural
language prompts, yet current methods face a precision-efficiency tradeoff.
Fine-tuning methods demand significant computational resources and large
datasets, while training-free techniques struggle with instruction
comprehension and edit quality. We resolve this dilemma by leveraging
large-scale Diffusion Transformer (DiT)' enhanced generation capacity and
native contextual awareness. Our solution introduces three contributions: (1)
an in-context editing framework for zero-shot instruction compliance using
in-context prompting, avoiding structural changes; (2) a LoRA-MoE hybrid tuning
strategy that enhances flexibility with efficient adaptation and dynamic expert
routing, without extensive retraining; and (3) an early filter inference-time
scaling method using vision-language models (VLMs) to select better initial
noise early, improving edit quality. Extensive evaluations demonstrate our
method's superiority: it outperforms state-of-the-art approaches while
requiring only 0.5% training data and 1% trainable parameters compared to
conventional baselines. This work establishes a new paradigm that enables
high-precision yet efficient instruction-guided editing. Codes and demos can be
found in https://river-zhang.github.io/ICEdit-gh-pages/.Summary
AI-Generated Summary