离散噪声反演用于下一代基于文本的自回归图像编辑
Discrete Noise Inversion for Next-scale Autoregressive Text-based Image Editing
September 2, 2025
作者: Quan Dao, Xiaoxiao He, Ligong Han, Ngan Hoai Nguyen, Amin Heyrani Nobar, Faez Ahmed, Han Zhang, Viet Anh Nguyen, Dimitris Metaxas
cs.AI
摘要
视觉自回归模型(VAR)作为一类新兴的生成模型,在文本到图像生成任务中表现出了与扩散模型相媲美的性能。尽管条件生成已被广泛研究,但无需额外训练即可实现基于提示的图像编辑能力同样至关重要,因为它支撑着众多实际应用场景。本文通过引入视觉自回归逆噪声(VARIN),首次针对VAR模型设计了基于噪声反转的编辑技术,深入探讨了VAR在文本到图像编辑方面的潜力。VARIN利用一种新颖的伪逆函数——位置感知Argmax反演(LAI)进行argmax采样,以生成逆Gumbel噪声。这些逆噪声不仅能够精确重建源图像,还能实现与文本提示对齐的定向、可控编辑。大量实验表明,VARIN能够根据指定提示有效修改源图像,同时显著保留原始背景和结构细节,从而验证了其作为一种实用编辑方法的有效性。
English
Visual autoregressive models (VAR) have recently emerged as a promising class
of generative models, achieving performance comparable to diffusion models in
text-to-image generation tasks. While conditional generation has been widely
explored, the ability to perform prompt-guided image editing without additional
training is equally critical, as it supports numerous practical real-world
applications. This paper investigates the text-to-image editing capabilities of
VAR by introducing Visual AutoRegressive Inverse Noise (VARIN), the first noise
inversion-based editing technique designed explicitly for VAR models. VARIN
leverages a novel pseudo-inverse function for argmax sampling, named
Location-aware Argmax Inversion (LAI), to generate inverse Gumbel noises. These
inverse noises enable precise reconstruction of the source image and facilitate
targeted, controllable edits aligned with textual prompts. Extensive
experiments demonstrate that VARIN effectively modifies source images according
to specified prompts while significantly preserving the original background and
structural details, thus validating its efficacy as a practical editing
approach.