ChatPaper.aiChatPaper

视觉自回归建模在指令引导图像编辑中的应用

Visual Autoregressive Modeling for Instruction-Guided Image Editing

August 21, 2025
作者: Qingyang Mao, Qi Cai, Yehao Li, Yingwei Pan, Mingyue Cheng, Ting Yao, Qi Liu, Tao Mei
cs.AI

摘要

近期,扩散模型在指令引导的图像编辑领域取得了显著进展,带来了卓越的视觉保真度。然而,其全局去噪过程本质上将编辑区域与整个图像上下文紧密耦合,导致非预期的虚假修改,并削弱了对编辑指令的遵循度。相比之下,自回归模型通过将图像合成表述为离散视觉标记的序列过程,提供了一种独特的范式。其因果与组合机制自然规避了基于扩散方法在遵循指令上的挑战。本文中,我们提出了VAREdit,一种视觉自回归(VAR)框架,将图像编辑重构为下一尺度预测问题。基于源图像特征与文本指令,VAREdit生成多尺度目标特征,以实现精准编辑。此范式中一个核心挑战是如何有效条件化源图像标记。我们观察到,最精细尺度的源特征无法有效指导较粗糙目标特征的预测。为弥合这一差距,我们引入了尺度对齐参考(SAR)模块,该模块将尺度匹配的条件信息注入首个自注意力层。VAREdit在编辑遵循度与效率上均展现出显著进步。在标准基准测试中,其GPT平衡分数较领先的扩散方法高出30%以上。此外,完成512×512图像编辑仅需1.2秒,比同等规模的UltraEdit快2.2倍。模型已发布于https://github.com/HiDream-ai/VAREdit。
English
Recent advances in diffusion models have brought remarkable visual fidelity to instruction-guided image editing. However, their global denoising process inherently entangles the edited region with the entire image context, leading to unintended spurious modifications and compromised adherence to editing instructions. In contrast, autoregressive models offer a distinct paradigm by formulating image synthesis as a sequential process over discrete visual tokens. Their causal and compositional mechanism naturally circumvents the adherence challenges of diffusion-based methods. In this paper, we present VAREdit, a visual autoregressive (VAR) framework that reframes image editing as a next-scale prediction problem. Conditioned on source image features and text instructions, VAREdit generates multi-scale target features to achieve precise edits. A core challenge in this paradigm is how to effectively condition the source image tokens. We observe that finest-scale source features cannot effectively guide the prediction of coarser target features. To bridge this gap, we introduce a Scale-Aligned Reference (SAR) module, which injects scale-matched conditioning information into the first self-attention layer. VAREdit demonstrates significant advancements in both editing adherence and efficiency. On standard benchmarks, it outperforms leading diffusion-based methods by 30\%+ higher GPT-Balance score. Moreover, it completes a 512times512 editing in 1.2 seconds, making it 2.2times faster than the similarly sized UltraEdit. The models are available at https://github.com/HiDream-ai/VAREdit.
PDF83August 22, 2025