整合注意力特徵以進行多視角圖像編輯
Consolidating Attention Features for Multi-view Image Editing
February 22, 2024
作者: Or Patashnik, Rinon Gal, Daniel Cohen-Or, Jun-Yan Zhu, Fernando De la Torre
cs.AI
摘要
大規模文本到圖像模型使得廣泛的圖像編輯技術成為可能,使用文本提示甚至空間控制。然而,將這些編輯方法應用於描繪單一場景的多視圖影像會導致3D不一致的結果。在這項工作中,我們專注於基於空間控制的幾何操作,並介紹一種方法來統一各種視角下的編輯過程。我們基於兩個見解進行研究:(1) 在生成過程中保持一致的特徵有助於實現多視圖編輯的一致性,以及(2) 自注意力層中的查詢顯著影響圖像結構。因此,我們提出通過強化查詢的一致性來改善編輯圖像的幾何一致性。為此,我們引入了QNeRF,這是一個基於編輯圖像的內部查詢特徵訓練的神經輻射場。一旦訓練完成,QNeRF能夠渲染出3D一致的查詢,然後在生成過程中軟性注入回自注意力層,大大提高多視圖的一致性。我們通過一種逐步迭代的方法來完善這個過程,更好地統一了擴散時間步中的查詢。我們將我們的方法與一系列現有技術進行比較,並證明它能夠實現更好的多視圖一致性,並對輸入場景具有更高的保真度。這些優勢使我們能夠訓練出具有更少視覺瑕疵且更好符合目標幾何形狀的NeRF。
English
Large-scale text-to-image models enable a wide range of image editing
techniques, using text prompts or even spatial controls. However, applying
these editing methods to multi-view images depicting a single scene leads to
3D-inconsistent results. In this work, we focus on spatial control-based
geometric manipulations and introduce a method to consolidate the editing
process across various views. We build on two insights: (1) maintaining
consistent features throughout the generative process helps attain consistency
in multi-view editing, and (2) the queries in self-attention layers
significantly influence the image structure. Hence, we propose to improve the
geometric consistency of the edited images by enforcing the consistency of the
queries. To do so, we introduce QNeRF, a neural radiance field trained on the
internal query features of the edited images. Once trained, QNeRF can render
3D-consistent queries, which are then softly injected back into the
self-attention layers during generation, greatly improving multi-view
consistency. We refine the process through a progressive, iterative method that
better consolidates queries across the diffusion timesteps. We compare our
method to a range of existing techniques and demonstrate that it can achieve
better multi-view consistency and higher fidelity to the input scene. These
advantages allow us to train NeRFs with fewer visual artifacts, that are better
aligned with the target geometry.