離散噪聲反演應用於下一代自回歸文本驅動圖像編輯

摘要

視覺自回歸模型（VAR）近期作為一類具有前景的生成模型嶄露頭角，在文本到圖像生成任務中表現可與擴散模型相媲美。儘管條件生成已被廣泛探索，無需額外訓練即可實現提示引導的圖像編輯能力同樣至關重要，因其支撐著眾多實際應用場景。本文通過引入視覺自回歸逆噪聲（VARIN），首次專為VAR模型設計的基於噪聲反轉的編輯技術，深入探討了VAR在文本到圖像編輯方面的能力。VARIN利用一種新穎的偽逆函數——位置感知Argmax反轉（LAI）——來生成逆Gumbel噪聲。這些逆噪聲不僅能精確重建源圖像，還能促進與文本提示對齊的定向、可控編輯。大量實驗表明，VARIN能有效根據指定提示修改源圖像，同時顯著保留原始背景與結構細節，從而驗證了其作為實用編輯方法的有效性。

English

Visual autoregressive models (VAR) have recently emerged as a promising class of generative models, achieving performance comparable to diffusion models in text-to-image generation tasks. While conditional generation has been widely explored, the ability to perform prompt-guided image editing without additional training is equally critical, as it supports numerous practical real-world applications. This paper investigates the text-to-image editing capabilities of VAR by introducing Visual AutoRegressive Inverse Noise (VARIN), the first noise inversion-based editing technique designed explicitly for VAR models. VARIN leverages a novel pseudo-inverse function for argmax sampling, named Location-aware Argmax Inversion (LAI), to generate inverse Gumbel noises. These inverse noises enable precise reconstruction of the source image and facilitate targeted, controllable edits aligned with textual prompts. Extensive experiments demonstrate that VARIN effectively modifies source images according to specified prompts while significantly preserving the original background and structural details, thus validating its efficacy as a practical editing approach.

離散噪聲反演應用於下一代自回歸文本驅動圖像編輯

Discrete Noise Inversion for Next-scale Autoregressive Text-based Image Editing

摘要

Support