Text2Control3D: ジオメトリ誘導型テキスト-to-画像拡散モデルを用いた制御可能な3Dアバター生成 in ニューラルラジアンスフィールド

要旨

ControlNetなどの拡散モデルの最近の進展により、幾何学的に制御可能で高精細なテキストから画像への生成が可能になりました。しかし、そのような制御性をテキストから3D生成に追加するという課題にはまだ取り組まれていません。これに対応して、私たちはText2Control3Dを提案します。これは、手持ちカメラで気軽に撮影された単眼動画を基に、顔の表情を制御可能なテキストから3Dアバターを生成する手法です。私たちの主な戦略は、ControlNetから生成された視点対応画像のセットを用いてNeural Radiance Fields（NeRF）で3Dアバターを構築することです。ControlNetの条件入力は、入力動画から抽出された深度マップです。視点対応画像を生成する際、クロスリファレンスアテンションを活用し、クロスアテンションを通じて制御された参照的な顔の表情と外観を注入します。また、拡散モデルのガウシアン潜在変数に対してローパスフィルタリングを実施し、私たちの実証分析で観察された視点非依存のテクスチャ問題を改善します。この問題では、視点対応画像が3Dでは理解できない同一ピクセル位置に同一テクスチャを含んでいます。最後に、視点対応でありながら幾何学的に厳密に一貫していない画像を用いてNeRFを訓練するために、私たちのアプローチでは、画像ごとの幾何学的変動を共有の3D正規空間からの変形として考慮します。その結果、変形場テーブルを通じて画像ごとの変形セットを学習することで、変形可能なNeRFの正規空間に3Dアバターを構築します。私たちは実証結果を示し、本手法の有効性について議論します。

English

Recent advances in diffusion models such as ControlNet have enabled geometrically controllable, high-fidelity text-to-image generation. However, none of them addresses the question of adding such controllability to text-to-3D generation. In response, we propose Text2Control3D, a controllable text-to-3D avatar generation method whose facial expression is controllable given a monocular video casually captured with hand-held camera. Our main strategy is to construct the 3D avatar in Neural Radiance Fields (NeRF) optimized with a set of controlled viewpoint-aware images that we generate from ControlNet, whose condition input is the depth map extracted from the input video. When generating the viewpoint-aware images, we utilize cross-reference attention to inject well-controlled, referential facial expression and appearance via cross attention. We also conduct low-pass filtering of Gaussian latent of the diffusion model in order to ameliorate the viewpoint-agnostic texture problem we observed from our empirical analysis, where the viewpoint-aware images contain identical textures on identical pixel positions that are incomprehensible in 3D. Finally, to train NeRF with the images that are viewpoint-aware yet are not strictly consistent in geometry, our approach considers per-image geometric variation as a view of deformation from a shared 3D canonical space. Consequently, we construct the 3D avatar in a canonical space of deformable NeRF by learning a set of per-image deformation via deformation field table. We demonstrate the empirical results and discuss the effectiveness of our method.

Text2Control3D: ジオメトリ誘導型テキスト-to-画像拡散モデルを用いた制御可能な3Dアバター生成 in ニューラルラジアンスフィールド

Text2Control3D: Controllable 3D Avatar Generation in Neural Radiance Fields using Geometry-Guided Text-to-Image Diffusion Model

要旨

Support