다음 단계 자기회귀 텍스트 기반 이미지 편성을 위한 이산 노이즈 역전파

초록

시각적 자기회귀 모델(VAR)은 최근 텍스트-이미지 생성 작업에서 확산 모델과 비슷한 성능을 달성하며 유망한 생성 모델 클래스로 부상했습니다. 조건부 생성은 널리 연구되었지만, 추가 학습 없이 프롬프트 기반 이미지 편집을 수행할 수 있는 능력 역시 수많은 실용적인 응용 분야를 지원하기 때문에 동등하게 중요합니다. 본 논문은 VAR 모델을 위해 특별히 설계된 첫 번째 노이즈 역변환 기반 편집 기술인 Visual AutoRegressive Inverse Noise(VARIN)를 소개함으로써 VAR의 텍스트-이미지 편집 능력을 탐구합니다. VARIN은 Location-aware Argmax Inversion(LAI)이라는 새로운 의사 역함수를 활용하여 역 Gumbel 노이즈를 생성합니다. 이러한 역 노이즈는 원본 이미지를 정확하게 재구성하고 텍스트 프롬프트와 일치하는 목표 지향적이고 제어 가능한 편집을 가능하게 합니다. 광범위한 실험을 통해 VARIN이 지정된 프롬프트에 따라 원본 이미지를 효과적으로 수정하면서도 원본 배경과 구조적 세부 사항을 크게 보존함을 입증함으로써 실용적인 편집 접근법으로서의 효용성을 검증했습니다.

English

Visual autoregressive models (VAR) have recently emerged as a promising class of generative models, achieving performance comparable to diffusion models in text-to-image generation tasks. While conditional generation has been widely explored, the ability to perform prompt-guided image editing without additional training is equally critical, as it supports numerous practical real-world applications. This paper investigates the text-to-image editing capabilities of VAR by introducing Visual AutoRegressive Inverse Noise (VARIN), the first noise inversion-based editing technique designed explicitly for VAR models. VARIN leverages a novel pseudo-inverse function for argmax sampling, named Location-aware Argmax Inversion (LAI), to generate inverse Gumbel noises. These inverse noises enable precise reconstruction of the source image and facilitate targeted, controllable edits aligned with textual prompts. Extensive experiments demonstrate that VARIN effectively modifies source images according to specified prompts while significantly preserving the original background and structural details, thus validating its efficacy as a practical editing approach.

다음 단계 자기회귀 텍스트 기반 이미지 편성을 위한 이산 노이즈 역전파

Discrete Noise Inversion for Next-scale Autoregressive Text-based Image Editing

초록

Support