觀點文本反轉：利用預訓練的2D擴散模型釋放新型觀點合成

摘要

文字到圖像擴散模型理解物體之間的空間關係，但它們是否僅從2D監督中表達了世界的真實3D結構？我們證明了是的，3D知識被編碼在像Stable Diffusion這樣的2D圖像擴散模型中，並且我們展示這種結構可以被用於3D視覺任務。我們的方法，視角神經文本反轉（ViewNeTI），控制從凍結的擴散模型生成的圖像中物體的3D視角。我們訓練一個小型神經映射器，以取得相機視角參數並預測文本編碼器的潛在變數；然後，這些潛在變數條件了擴散生成過程，以產生具有所需相機視角的圖像。 ViewNeTI自然地解決了新視角合成（NVS）。通過利用凍結的擴散模型作為先驗，我們可以用非常少的輸入視圖解決NVS；我們甚至可以進行單視圖新視角合成。與先前方法相比，我們的單視圖NVS預測具有良好的語義細節和照片逼真感。我們的方法非常適合建模稀疏3D視覺問題中固有的不確定性，因為它可以有效生成多樣的樣本。我們的視角控制機制是通用的，甚至可以在由用戶定義的提示生成的圖像中更改相機視角。

English

Text-to-image diffusion models understand spatial relationship between objects, but do they represent the true 3D structure of the world from only 2D supervision? We demonstrate that yes, 3D knowledge is encoded in 2D image diffusion models like Stable Diffusion, and we show that this structure can be exploited for 3D vision tasks. Our method, Viewpoint Neural Textual Inversion (ViewNeTI), controls the 3D viewpoint of objects in generated images from frozen diffusion models. We train a small neural mapper to take camera viewpoint parameters and predict text encoder latents; the latents then condition the diffusion generation process to produce images with the desired camera viewpoint. ViewNeTI naturally addresses Novel View Synthesis (NVS). By leveraging the frozen diffusion model as a prior, we can solve NVS with very few input views; we can even do single-view novel view synthesis. Our single-view NVS predictions have good semantic details and photorealism compared to prior methods. Our approach is well suited for modeling the uncertainty inherent in sparse 3D vision problems because it can efficiently generate diverse samples. Our view-control mechanism is general, and can even change the camera view in images generated by user-defined prompts.

觀點文本反轉：利用預訓練的2D擴散模型釋放新型觀點合成

Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models

摘要

Support