可学习注册的即时多视角头部捕捉

摘要

目前用于捕获具有密集语义对应关系的3D头部数据集的方法速度较慢，通常通过两个独立步骤来解决问题；即多视图立体（MVS）重建，然后是非刚性配准。为了简化这一过程，我们引入了TEMPEH（Towards Estimation of 3D Meshes from Performances of Expressive Heads），可以直接从经过校准的多视图图像中推断具有密集对应关系的3D头部。通常，注册3D扫描数据集需要手动参数调整，以找到准确拟合扫描表面并对扫描噪声和异常值具有鲁棒性之间的平衡。相反，我们建议在训练TEMPEH的同时联合注册3D头部数据集。具体来说，在训练过程中，我们最小化了一种常用于表面配准的几何损失，有效地利用TEMPEH作为正则化器。我们的多视图头部推断基于体积特征表示，利用摄像机校准信息从每个视图中采样和融合特征。为了考虑部分遮挡和大范围捕捉体积以实现头部运动，我们使用了视图和表面感知特征融合，以及基于空间变换器的头部定位模块。在训练过程中，我们使用原始MVS扫描作为监督，但一旦训练完成，TEMPEH可以直接预测具有密集对应关系的3D头部，而无需扫描。预测一个头部大约需要0.3秒，中位重建误差为0.26毫米，比当前最先进技术低64%。这使得可以高效捕获包含多人和多样面部动作的大型数据集。代码、模型和数据可在https://tempeh.is.tue.mpg.de 上公开获取。

English

Existing methods for capturing datasets of 3D heads in dense semantic correspondence are slow, and commonly address the problem in two separate steps; multi-view stereo (MVS) reconstruction followed by non-rigid registration. To simplify this process, we introduce TEMPEH (Towards Estimation of 3D Meshes from Performances of Expressive Heads) to directly infer 3D heads in dense correspondence from calibrated multi-view images. Registering datasets of 3D scans typically requires manual parameter tuning to find the right balance between accurately fitting the scans surfaces and being robust to scanning noise and outliers. Instead, we propose to jointly register a 3D head dataset while training TEMPEH. Specifically, during training we minimize a geometric loss commonly used for surface registration, effectively leveraging TEMPEH as a regularizer. Our multi-view head inference builds on a volumetric feature representation that samples and fuses features from each view using camera calibration information. To account for partial occlusions and a large capture volume that enables head movements, we use view- and surface-aware feature fusion, and a spatial transformer-based head localization module, respectively. We use raw MVS scans as supervision during training, but, once trained, TEMPEH directly predicts 3D heads in dense correspondence without requiring scans. Predicting one head takes about 0.3 seconds with a median reconstruction error of 0.26 mm, 64% lower than the current state-of-the-art. This enables the efficient capture of large datasets containing multiple people and diverse facial motions. Code, model, and data are publicly available at https://tempeh.is.tue.mpg.de.

可学习注册的即时多视角头部捕捉

Instant Multi-View Head Capture through Learnable Registration

摘要

Support