整合注意力特徵以進行多視角圖像編輯

摘要

大規模文本到圖像模型使得廣泛的圖像編輯技術成為可能，使用文本提示甚至空間控制。然而，將這些編輯方法應用於描繪單一場景的多視圖影像會導致3D不一致的結果。在這項工作中，我們專注於基於空間控制的幾何操作，並介紹一種方法來統一各種視角下的編輯過程。我們基於兩個見解進行研究：(1) 在生成過程中保持一致的特徵有助於實現多視圖編輯的一致性，以及(2) 自注意力層中的查詢顯著影響圖像結構。因此，我們提出通過強化查詢的一致性來改善編輯圖像的幾何一致性。為此，我們引入了QNeRF，這是一個基於編輯圖像的內部查詢特徵訓練的神經輻射場。一旦訓練完成，QNeRF能夠渲染出3D一致的查詢，然後在生成過程中軟性注入回自注意力層，大大提高多視圖的一致性。我們通過一種逐步迭代的方法來完善這個過程，更好地統一了擴散時間步中的查詢。我們將我們的方法與一系列現有技術進行比較，並證明它能夠實現更好的多視圖一致性，並對輸入場景具有更高的保真度。這些優勢使我們能夠訓練出具有更少視覺瑕疵且更好符合目標幾何形狀的NeRF。

English

Large-scale text-to-image models enable a wide range of image editing techniques, using text prompts or even spatial controls. However, applying these editing methods to multi-view images depicting a single scene leads to 3D-inconsistent results. In this work, we focus on spatial control-based geometric manipulations and introduce a method to consolidate the editing process across various views. We build on two insights: (1) maintaining consistent features throughout the generative process helps attain consistency in multi-view editing, and (2) the queries in self-attention layers significantly influence the image structure. Hence, we propose to improve the geometric consistency of the edited images by enforcing the consistency of the queries. To do so, we introduce QNeRF, a neural radiance field trained on the internal query features of the edited images. Once trained, QNeRF can render 3D-consistent queries, which are then softly injected back into the self-attention layers during generation, greatly improving multi-view consistency. We refine the process through a progressive, iterative method that better consolidates queries across the diffusion timesteps. We compare our method to a range of existing techniques and demonstrate that it can achieve better multi-view consistency and higher fidelity to the input scene. These advantages allow us to train NeRFs with fewer visual artifacts, that are better aligned with the target geometry.

整合注意力特徵以進行多視角圖像編輯

Consolidating Attention Features for Multi-view Image Editing

摘要

Support