指示に基づく画像編集のための視覚的オートリグレッシブモデリング

要旨

最近の拡散モデルの進展により、指示に基づく画像編集において顕著な視覚的忠実度が実現されている。しかし、そのグローバルなノイズ除去プロセスは本質的に編集対象領域と画像全体の文脈を絡み合わせるため、意図しない不要な変更や編集指示への忠実度の低下を引き起こす。一方、自己回帰モデルは、画像合成を離散的な視覚トークンに対する逐次プロセスとして定式化するという異なるパラダイムを提供する。その因果的かつ構成的なメカニズムは、拡散ベースの手法が抱える忠実度の問題を自然に回避する。本論文では、画像編集を次スケール予測問題として再定義する視覚的自己回帰（VAR）フレームワークであるVAREditを提案する。VAREditは、ソース画像の特徴とテキスト指示に基づいて、多スケールのターゲット特徴を生成し、精密な編集を実現する。このパラダイムにおける核心的な課題は、ソース画像トークンをどのように効果的に条件付けするかである。我々は、最も細かいスケールのソース特徴が、より粗いターゲット特徴の予測を効果的に導くことができないことを観察した。このギャップを埋めるため、スケール整合参照（SAR）モジュールを導入し、最初の自己注意層にスケールに合致した条件付け情報を注入する。VAREditは、編集の忠実度と効率の両面で大きな進歩を示している。標準ベンチマークにおいて、VAREditは主要な拡散ベースの手法を30％以上のGPT-Balanceスコアで上回る。さらに、512×512の編集を1.2秒で完了し、同サイズのUltraEditよりも2.2倍高速である。モデルはhttps://github.com/HiDream-ai/VAREditで公開されている。

English

Recent advances in diffusion models have brought remarkable visual fidelity to instruction-guided image editing. However, their global denoising process inherently entangles the edited region with the entire image context, leading to unintended spurious modifications and compromised adherence to editing instructions. In contrast, autoregressive models offer a distinct paradigm by formulating image synthesis as a sequential process over discrete visual tokens. Their causal and compositional mechanism naturally circumvents the adherence challenges of diffusion-based methods. In this paper, we present VAREdit, a visual autoregressive (VAR) framework that reframes image editing as a next-scale prediction problem. Conditioned on source image features and text instructions, VAREdit generates multi-scale target features to achieve precise edits. A core challenge in this paradigm is how to effectively condition the source image tokens. We observe that finest-scale source features cannot effectively guide the prediction of coarser target features. To bridge this gap, we introduce a Scale-Aligned Reference (SAR) module, which injects scale-matched conditioning information into the first self-attention layer. VAREdit demonstrates significant advancements in both editing adherence and efficiency. On standard benchmarks, it outperforms leading diffusion-based methods by 30\%+ higher GPT-Balance score. Moreover, it completes a 512times512 editing in 1.2 seconds, making it 2.2times faster than the similarly sized UltraEdit. The models are available at https://github.com/HiDream-ai/VAREdit.

指示に基づく画像編集のための視覚的オートリグレッシブモデリング

Visual Autoregressive Modeling for Instruction-Guided Image Editing

要旨

Support