離散ノイズ反転による次世代オートリグレッシブテキストベース画像編集

要旨

視覚的自己回帰モデル（VAR）は最近、テキストから画像を生成するタスクにおいて拡散モデルに匹敵する性能を達成し、有望な生成モデルのクラスとして登場しました。条件付き生成は広く研究されていますが、追加の学習なしでプロンプトに基づく画像編集を行う能力も同様に重要であり、多くの実用的な現実世界のアプリケーションをサポートします。本論文では、VARモデルに特化した初のノイズ反転ベースの編集技術であるVisual AutoRegressive Inverse Noise（VARIN）を導入し、VARのテキストから画像編集能力を調査します。VARINは、Location-aware Argmax Inversion（LAI）と呼ばれる新しい疑似逆関数を活用して、逆ガンベルノイズを生成します。これらの逆ノイズにより、ソース画像の正確な再構成が可能となり、テキストプロンプトに沿ったターゲット指向で制御可能な編集を容易にします。大規模な実験により、VARINが指定されたプロンプトに従ってソース画像を効果的に変更しつつ、元の背景や構造的詳細を大幅に保持することが実証され、実用的な編集手法としての有効性が確認されました。

English

Visual autoregressive models (VAR) have recently emerged as a promising class of generative models, achieving performance comparable to diffusion models in text-to-image generation tasks. While conditional generation has been widely explored, the ability to perform prompt-guided image editing without additional training is equally critical, as it supports numerous practical real-world applications. This paper investigates the text-to-image editing capabilities of VAR by introducing Visual AutoRegressive Inverse Noise (VARIN), the first noise inversion-based editing technique designed explicitly for VAR models. VARIN leverages a novel pseudo-inverse function for argmax sampling, named Location-aware Argmax Inversion (LAI), to generate inverse Gumbel noises. These inverse noises enable precise reconstruction of the source image and facilitate targeted, controllable edits aligned with textual prompts. Extensive experiments demonstrate that VARIN effectively modifies source images according to specified prompts while significantly preserving the original background and structural details, thus validating its efficacy as a practical editing approach.

離散ノイズ反転による次世代オートリグレッシブテキストベース画像編集

Discrete Noise Inversion for Next-scale Autoregressive Text-based Image Editing

要旨

Support