視点テキスト反転：事前学習済み2D拡散モデルによる新規視点合成の解放

要旨

テキストから画像を生成する拡散モデルは、物体間の空間的関係を理解しますが、2次元の監視のみから世界の真の3次元構造を表現しているのでしょうか？私たちは、Stable Diffusionのような2次元画像拡散モデルに3次元の知識がエンコードされていることを実証し、この構造を3次元視覚タスクに活用できることを示します。私たちの手法であるViewpoint Neural Textual Inversion（ViewNeTI）は、凍結された拡散モデルから生成される画像内の物体の3次元視点を制御します。小さなニューラルマッパーを訓練し、カメラ視点パラメータを受け取ってテキストエンコーダの潜在変数を予測します。これらの潜在変数は、拡散生成プロセスを条件付けし、所望のカメラ視点を持つ画像を生成します。 ViewNeTIは、新視点合成（Novel View Synthesis, NVS）に自然に対応します。凍結された拡散モデルを事前知識として活用することで、非常に少ない入力ビューでNVSを解決できます。さらに、単一ビューからの新視点合成も可能です。私たちの単一ビューNVS予測は、従来の手法と比較して優れた意味的詳細と写実性を持っています。このアプローチは、不確実性を内在する疎な3次元視覚問題のモデリングに適しています。なぜなら、多様なサンプルを効率的に生成できるからです。私たちの視点制御メカニズムは汎用的であり、ユーザー定義のプロンプトから生成された画像のカメラ視点を変更することもできます。

English

Text-to-image diffusion models understand spatial relationship between objects, but do they represent the true 3D structure of the world from only 2D supervision? We demonstrate that yes, 3D knowledge is encoded in 2D image diffusion models like Stable Diffusion, and we show that this structure can be exploited for 3D vision tasks. Our method, Viewpoint Neural Textual Inversion (ViewNeTI), controls the 3D viewpoint of objects in generated images from frozen diffusion models. We train a small neural mapper to take camera viewpoint parameters and predict text encoder latents; the latents then condition the diffusion generation process to produce images with the desired camera viewpoint. ViewNeTI naturally addresses Novel View Synthesis (NVS). By leveraging the frozen diffusion model as a prior, we can solve NVS with very few input views; we can even do single-view novel view synthesis. Our single-view NVS predictions have good semantic details and photorealism compared to prior methods. Our approach is well suited for modeling the uncertainty inherent in sparse 3D vision problems because it can efficiently generate diverse samples. Our view-control mechanism is general, and can even change the camera view in images generated by user-defined prompts.

視点テキスト反転：事前学習済み2D拡散モデルによる新規視点合成の解放

Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models

要旨

Support