インコンテキスト編集：大規模拡散トランスフォーマーにおけるインコンテキスト生成を活用した指導的画像編集

要旨

指示に基づく画像編集は、自然言語プロンプトを通じて堅牢な画像修正を可能にするが、現在の手法は精度と効率性のトレードオフに直面している。ファインチューニング手法は、多大な計算リソースと大規模なデータセットを必要とする一方で、トレーニング不要の技術は指示の理解と編集品質に苦戦している。我々は、大規模なDiffusion Transformer (DiT)の強化された生成能力とネイティブな文脈認識を活用することで、このジレンマを解決する。本解決策は以下の3つの貢献を導入する：(1) 構造変更を避けつつ、文脈内プロンプトを使用したゼロショット指示準拠のための文脈内編集フレームワーク、(2) 大規模な再トレーニングを必要とせず、効率的な適応と動的エキスパートルーティングにより柔軟性を高めるLoRA-MoEハイブリッドチューニング戦略、(3) 視覚言語モデル（VLM）を使用した初期フィルタ推論時間スケーリング手法により、より良い初期ノイズを早期に選択し、編集品質を向上させる。広範な評価により、我々の手法の優位性が示されている：従来のベースラインと比較して、わずか0.5%のトレーニングデータと1%の学習可能パラメータで、最先端のアプローチを凌駕する。本研究成果は、高精度かつ効率的な指示ガイド付き編集を可能にする新たなパラダイムを確立する。コードとデモはhttps://river-zhang.github.io/ICEdit-gh-pages/で確認できる。

English

Instruction-based image editing enables robust image modification via natural language prompts, yet current methods face a precision-efficiency tradeoff. Fine-tuning methods demand significant computational resources and large datasets, while training-free techniques struggle with instruction comprehension and edit quality. We resolve this dilemma by leveraging large-scale Diffusion Transformer (DiT)' enhanced generation capacity and native contextual awareness. Our solution introduces three contributions: (1) an in-context editing framework for zero-shot instruction compliance using in-context prompting, avoiding structural changes; (2) a LoRA-MoE hybrid tuning strategy that enhances flexibility with efficient adaptation and dynamic expert routing, without extensive retraining; and (3) an early filter inference-time scaling method using vision-language models (VLMs) to select better initial noise early, improving edit quality. Extensive evaluations demonstrate our method's superiority: it outperforms state-of-the-art approaches while requiring only 0.5% training data and 1% trainable parameters compared to conventional baselines. This work establishes a new paradigm that enables high-precision yet efficient instruction-guided editing. Codes and demos can be found in https://river-zhang.github.io/ICEdit-gh-pages/.

インコンテキスト編集：大規模拡散トランスフォーマーにおけるインコンテキスト生成を活用した指導的画像編集

In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

要旨

Support