マルチビュー画像編集のための注意特徴の統合

要旨

大規模なテキストから画像へのモデルは、テキストプロンプトや空間制御を用いて、幅広い画像編集技術を可能にします。しかし、これらの編集手法を単一シーンを描いた多視点画像に適用すると、3D整合性のない結果が生じます。本研究では、空間制御に基づく幾何学的操作に焦点を当て、さまざまな視点間で編集プロセスを統合する方法を提案します。私たちは次の2つの洞察に基づいて取り組みます：(1)生成プロセス全体で一貫した特徴を維持することが、多視点編集における整合性を達成するのに役立つこと、(2)自己注意層のクエリが画像構造に大きな影響を与えること。したがって、クエリの整合性を強化することで、編集された画像の幾何学的整合性を向上させることを提案します。そのために、編集された画像の内部クエリ特徴に基づいて訓練されたニューラルラジアンスフィールド（QNeRF）を導入します。一度訓練されると、QNeRFは3D整合性のあるクエリをレンダリングし、それらを生成中に自己注意層にソフトに注入することで、多視点整合性を大幅に向上させます。また、拡散タイムステップ間でクエリをより良く統合するために、漸進的で反復的な方法を通じてプロセスを洗練します。私たちの手法を既存の技術と比較し、より優れた多視点整合性と入力シーンへの忠実度を達成できることを示します。これらの利点により、視覚的なアーティファクトが少なく、目標とする幾何学に適切に整列したNeRFを訓練することが可能になります。

English

Large-scale text-to-image models enable a wide range of image editing techniques, using text prompts or even spatial controls. However, applying these editing methods to multi-view images depicting a single scene leads to 3D-inconsistent results. In this work, we focus on spatial control-based geometric manipulations and introduce a method to consolidate the editing process across various views. We build on two insights: (1) maintaining consistent features throughout the generative process helps attain consistency in multi-view editing, and (2) the queries in self-attention layers significantly influence the image structure. Hence, we propose to improve the geometric consistency of the edited images by enforcing the consistency of the queries. To do so, we introduce QNeRF, a neural radiance field trained on the internal query features of the edited images. Once trained, QNeRF can render 3D-consistent queries, which are then softly injected back into the self-attention layers during generation, greatly improving multi-view consistency. We refine the process through a progressive, iterative method that better consolidates queries across the diffusion timesteps. We compare our method to a range of existing techniques and demonstrate that it can achieve better multi-view consistency and higher fidelity to the input scene. These advantages allow us to train NeRFs with fewer visual artifacts, that are better aligned with the target geometry.

マルチビュー画像編集のための注意特徴の統合

Consolidating Attention Features for Multi-view Image Editing

要旨

Support