Text2Control3D：使用幾何引導的文本到圖像擴散模型在神經輝度場中生成可控的3D頭像

摘要

最近在擴散模型方面的進展，如ControlNet，已經實現了幾何可控、高保真的文本到圖像生成。然而，目前還沒有解決如何將這種可控性應用到文本到三維生成的問題。為此，我們提出了Text2Control3D，一種可控的文本到三維頭像生成方法，其面部表情可以在使用手持攝像機隨意拍攝的單眼視頻中進行控制。我們的主要策略是在神經輻射場（NeRF）中構建三維頭像，並優化一組受控視角感知圖像，這些圖像是從ControlNet生成的，其條件輸入是從輸入視頻中提取的深度圖。在生成視角感知圖像時，我們利用交叉參考注意力，通過交叉關注注入良好控制的參考面部表情和外觀。我們還對擴散模型的高斯潛在進行低通濾波，以改善我們從實證分析中觀察到的與視角無關的紋理問題，其中視角感知圖像在相同像素位置包含相同的紋理，這在三維中是難以理解的。最後，為了訓練NeRF，使其能夠處理視角感知但幾何上不嚴格一致的圖像，我們的方法將每個圖像的幾何變化視為從共享的三維標準空間中的變形。因此，我們通過學習通過變形場表格的一組每個圖像變形，來在可變形的NeRF標準空間中構建三維頭像。我們展示了實證結果並討論了我們方法的有效性。

English

Recent advances in diffusion models such as ControlNet have enabled geometrically controllable, high-fidelity text-to-image generation. However, none of them addresses the question of adding such controllability to text-to-3D generation. In response, we propose Text2Control3D, a controllable text-to-3D avatar generation method whose facial expression is controllable given a monocular video casually captured with hand-held camera. Our main strategy is to construct the 3D avatar in Neural Radiance Fields (NeRF) optimized with a set of controlled viewpoint-aware images that we generate from ControlNet, whose condition input is the depth map extracted from the input video. When generating the viewpoint-aware images, we utilize cross-reference attention to inject well-controlled, referential facial expression and appearance via cross attention. We also conduct low-pass filtering of Gaussian latent of the diffusion model in order to ameliorate the viewpoint-agnostic texture problem we observed from our empirical analysis, where the viewpoint-aware images contain identical textures on identical pixel positions that are incomprehensible in 3D. Finally, to train NeRF with the images that are viewpoint-aware yet are not strictly consistent in geometry, our approach considers per-image geometric variation as a view of deformation from a shared 3D canonical space. Consequently, we construct the 3D avatar in a canonical space of deformable NeRF by learning a set of per-image deformation via deformation field table. We demonstrate the empirical results and discuss the effectiveness of our method.

Text2Control3D：使用幾何引導的文本到圖像擴散模型在神經輝度場中生成可控的3D頭像

Text2Control3D: Controllable 3D Avatar Generation in Neural Radiance Fields using Geometry-Guided Text-to-Image Diffusion Model

摘要

Support