基於多人多視角視頻的單次掃描連貫人體場景重建

摘要

近期三维基础模型的突破性进展，使得人体及其周边环境的重建技术备受关注。然而现有方法多集中于单目输入，将其扩展至多视角场景需依赖附加模块或预处理数据。为此，我们提出CHROMM统一框架，该框架可直接从多人多视角视频中联合估计相机参数、场景点云及人体网格，无需外部模块或预处理支持。我们通过将Pi3X的几何先验与Multi-HMR的人体先验整合至可训练的单一神经网络架构，并引入尺度调整模块以解决人体与场景的尺度差异问题。同时提出多视角融合策略，在测试阶段将各视角估计结果聚合为统一表征。此外还开发了基于几何特征的多人关联方法，相较基于外观的方法更具鲁棒性。在EMDB、RICH、EgoHumans和EgoExo4D数据集上的实验表明，CHROMM在全局人体运动估计与多视角姿态估计任务中均达到先进水平，且运行速度较基于优化的多视角方法提升8倍以上。项目页面：https://nstar1125.github.io/chromm。

English

Recent advances in 3D foundation models have led to growing interest in reconstructing humans and their surrounding environments. However, most existing approaches focus on monocular inputs, and extending them to multi-view settings requires additional overhead modules or preprocessed data. To this end, we present CHROMM, a unified framework that jointly estimates cameras, scene point clouds, and human meshes from multi-person multi-view videos without relying on external modules or preprocessing. We integrate strong geometric and human priors from Pi3X and Multi-HMR into a single trainable neural network architecture, and introduce a scale adjustment module to solve the scale discrepancy between humans and the scene. We also introduce a multi-view fusion strategy to aggregate per-view estimates into a single representation at test-time. Finally, we propose a geometry-based multi-person association method, which is more robust than appearance-based approaches. Experiments on EMDB, RICH, EgoHumans, and EgoExo4D show that CHROMM achieves competitive performance in global human motion and multi-view pose estimation while running over 8x faster than prior optimization-based multi-view approaches. Project page: https://nstar1125.github.io/chromm.

基於多人多視角視頻的單次掃描連貫人體場景重建

Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass

摘要

Support