ChatPaper.aiChatPaper

視覺自回歸建模於指令引導的圖像編輯應用

Visual Autoregressive Modeling for Instruction-Guided Image Editing

August 21, 2025
作者: Qingyang Mao, Qi Cai, Yehao Li, Yingwei Pan, Mingyue Cheng, Ting Yao, Qi Liu, Tao Mei
cs.AI

摘要

近期,擴散模型的進展在指令引導的圖像編輯領域帶來了顯著的視覺保真度。然而,其全局去噪過程本質上將編輯區域與整個圖像上下文糾纏在一起,導致了非預期的虛假修改,並削弱了對編輯指令的遵循。相比之下,自回歸模型提供了一種獨特的範式,通過將圖像合成表述為離散視覺標記的序列過程。其因果與組合機制自然規避了基於擴散方法在遵循指令上的挑戰。本文中,我們提出了VAREdit,一種視覺自回歸(VAR)框架,將圖像編輯重新定義為下一尺度預測問題。基於源圖像特徵和文本指令,VAREdit生成多尺度目標特徵以實現精確編輯。此範式中的一個核心挑戰是如何有效地條件化源圖像標記。我們觀察到,最細尺度的源特徵無法有效指導較粗目標特徵的預測。為彌合這一差距,我們引入了尺度對齊參考(SAR)模塊,該模塊將尺度匹配的條件信息注入到第一個自注意力層中。VAREdit在編輯遵循度和效率上均展現了顯著進步。在標準基準測試中,它以超過30%的GPT-Balance分數領先於基於擴散的頂尖方法。此外,它能在1.2秒內完成512×512的圖像編輯,比同等規模的UltraEdit快2.2倍。模型可在https://github.com/HiDream-ai/VAREdit獲取。
English
Recent advances in diffusion models have brought remarkable visual fidelity to instruction-guided image editing. However, their global denoising process inherently entangles the edited region with the entire image context, leading to unintended spurious modifications and compromised adherence to editing instructions. In contrast, autoregressive models offer a distinct paradigm by formulating image synthesis as a sequential process over discrete visual tokens. Their causal and compositional mechanism naturally circumvents the adherence challenges of diffusion-based methods. In this paper, we present VAREdit, a visual autoregressive (VAR) framework that reframes image editing as a next-scale prediction problem. Conditioned on source image features and text instructions, VAREdit generates multi-scale target features to achieve precise edits. A core challenge in this paradigm is how to effectively condition the source image tokens. We observe that finest-scale source features cannot effectively guide the prediction of coarser target features. To bridge this gap, we introduce a Scale-Aligned Reference (SAR) module, which injects scale-matched conditioning information into the first self-attention layer. VAREdit demonstrates significant advancements in both editing adherence and efficiency. On standard benchmarks, it outperforms leading diffusion-based methods by 30\%+ higher GPT-Balance score. Moreover, it completes a 512times512 editing in 1.2 seconds, making it 2.2times faster than the similarly sized UltraEdit. The models are available at https://github.com/HiDream-ai/VAREdit.
PDF83August 22, 2025