Text2Control3D:使用幾何引導的文本到圖像擴散模型在神經輝度場中生成可控的3D頭像
Text2Control3D: Controllable 3D Avatar Generation in Neural Radiance Fields using Geometry-Guided Text-to-Image Diffusion Model
September 7, 2023
作者: Sungwon Hwang, Junha Hyung, Jaegul Choo
cs.AI
摘要
最近在擴散模型方面的進展,如ControlNet,已經實現了幾何可控、高保真的文本到圖像生成。然而,目前還沒有解決如何將這種可控性應用到文本到三維生成的問題。為此,我們提出了Text2Control3D,一種可控的文本到三維頭像生成方法,其面部表情可以在使用手持攝像機隨意拍攝的單眼視頻中進行控制。我們的主要策略是在神經輻射場(NeRF)中構建三維頭像,並優化一組受控視角感知圖像,這些圖像是從ControlNet生成的,其條件輸入是從輸入視頻中提取的深度圖。在生成視角感知圖像時,我們利用交叉參考注意力,通過交叉關注注入良好控制的參考面部表情和外觀。我們還對擴散模型的高斯潛在進行低通濾波,以改善我們從實證分析中觀察到的與視角無關的紋理問題,其中視角感知圖像在相同像素位置包含相同的紋理,這在三維中是難以理解的。最後,為了訓練NeRF,使其能夠處理視角感知但幾何上不嚴格一致的圖像,我們的方法將每個圖像的幾何變化視為從共享的三維標準空間中的變形。因此,我們通過學習通過變形場表格的一組每個圖像變形,來在可變形的NeRF標準空間中構建三維頭像。我們展示了實證結果並討論了我們方法的有效性。
English
Recent advances in diffusion models such as ControlNet have enabled
geometrically controllable, high-fidelity text-to-image generation. However,
none of them addresses the question of adding such controllability to
text-to-3D generation. In response, we propose Text2Control3D, a controllable
text-to-3D avatar generation method whose facial expression is controllable
given a monocular video casually captured with hand-held camera. Our main
strategy is to construct the 3D avatar in Neural Radiance Fields (NeRF)
optimized with a set of controlled viewpoint-aware images that we generate from
ControlNet, whose condition input is the depth map extracted from the input
video. When generating the viewpoint-aware images, we utilize cross-reference
attention to inject well-controlled, referential facial expression and
appearance via cross attention. We also conduct low-pass filtering of Gaussian
latent of the diffusion model in order to ameliorate the viewpoint-agnostic
texture problem we observed from our empirical analysis, where the
viewpoint-aware images contain identical textures on identical pixel positions
that are incomprehensible in 3D. Finally, to train NeRF with the images that
are viewpoint-aware yet are not strictly consistent in geometry, our approach
considers per-image geometric variation as a view of deformation from a shared
3D canonical space. Consequently, we construct the 3D avatar in a canonical
space of deformable NeRF by learning a set of per-image deformation via
deformation field table. We demonstrate the empirical results and discuss the
effectiveness of our method.