离散噪声反演用于下一代基于文本的自回归图像编辑

摘要

视觉自回归模型（VAR）作为一类新兴的生成模型，在文本到图像生成任务中表现出了与扩散模型相媲美的性能。尽管条件生成已被广泛研究，但无需额外训练即可实现基于提示的图像编辑能力同样至关重要，因为它支撑着众多实际应用场景。本文通过引入视觉自回归逆噪声（VARIN），首次针对VAR模型设计了基于噪声反转的编辑技术，深入探讨了VAR在文本到图像编辑方面的潜力。VARIN利用一种新颖的伪逆函数——位置感知Argmax反演（LAI）进行argmax采样，以生成逆Gumbel噪声。这些逆噪声不仅能够精确重建源图像，还能实现与文本提示对齐的定向、可控编辑。大量实验表明，VARIN能够根据指定提示有效修改源图像，同时显著保留原始背景和结构细节，从而验证了其作为一种实用编辑方法的有效性。

English

Visual autoregressive models (VAR) have recently emerged as a promising class of generative models, achieving performance comparable to diffusion models in text-to-image generation tasks. While conditional generation has been widely explored, the ability to perform prompt-guided image editing without additional training is equally critical, as it supports numerous practical real-world applications. This paper investigates the text-to-image editing capabilities of VAR by introducing Visual AutoRegressive Inverse Noise (VARIN), the first noise inversion-based editing technique designed explicitly for VAR models. VARIN leverages a novel pseudo-inverse function for argmax sampling, named Location-aware Argmax Inversion (LAI), to generate inverse Gumbel noises. These inverse noises enable precise reconstruction of the source image and facilitate targeted, controllable edits aligned with textual prompts. Extensive experiments demonstrate that VARIN effectively modifies source images according to specified prompts while significantly preserving the original background and structural details, thus validating its efficacy as a practical editing approach.

离散噪声反演用于下一代基于文本的自回归图像编辑

Discrete Noise Inversion for Next-scale Autoregressive Text-based Image Editing

摘要

Support