ChatPaper.aiChatPaper

人类三重R:无处不在,时刻在线,全面覆盖

Human3R: Everyone Everywhere All at Once

October 7, 2025
作者: Yue Chen, Xingyu Chen, Yuxuan Xue, Anpei Chen, Yuliang Xiu, Gerard Pons-Moll
cs.AI

摘要

我们提出Human3R,一个统一的、前馈式框架,用于从随意拍摄的单目视频中在线进行世界坐标系下的4D人体场景重建。与以往依赖多阶段流程、人体与场景间迭代接触感知优化及高度依赖(如人体检测、深度估计和SLAM预处理)的方法不同,Human3R在一次前向传递中联合恢复全局多人SMPL-X模型(“所有人”)、密集3D场景(“全方位”)及相机轨迹(“一次性完成”)。我们的方法基于4D在线重建模型CUT3R,并采用参数高效的视觉提示调优,力求保留CUT3R丰富的时空先验,同时实现多个SMPL-X模型的直接读取。Human3R作为一个统一模型,消除了繁重的依赖和迭代优化。仅需在相对小规模的合成数据集BEDLAM上训练一天,使用一块GPU,它便以卓越的效率实现了优异性能:实时速度(15帧/秒)下,低内存占用(8GB),单阶段内一次性重建多人体及3D场景。大量实验表明,Human3R在全局人体运动估计、局部人体网格恢复、视频深度估计及相机姿态估计等任务中,凭借单一统一模型,均达到了业界领先或具有竞争力的性能。我们希望Human3R能作为一个简洁而强大的基线,易于扩展至下游应用。代码可在https://fanegg.github.io/Human3R获取。
English
We present Human3R, a unified, feed-forward framework for online 4D human-scene reconstruction, in the world frame, from casually captured monocular videos. Unlike previous approaches that rely on multi-stage pipelines, iterative contact-aware refinement between humans and scenes, and heavy dependencies, e.g., human detection, depth estimation, and SLAM pre-processing, Human3R jointly recovers global multi-person SMPL-X bodies ("everyone"), dense 3D scene ("everywhere"), and camera trajectories in a single forward pass ("all-at-once"). Our method builds upon the 4D online reconstruction model CUT3R, and uses parameter-efficient visual prompt tuning, to strive to preserve CUT3R's rich spatiotemporal priors, while enabling direct readout of multiple SMPL-X bodies. Human3R is a unified model that eliminates heavy dependencies and iterative refinement. After being trained on the relatively small-scale synthetic dataset BEDLAM for just one day on one GPU, it achieves superior performance with remarkable efficiency: it reconstructs multiple humans in a one-shot manner, along with 3D scenes, in one stage, at real-time speed (15 FPS) with a low memory footprint (8 GB). Extensive experiments demonstrate that Human3R delivers state-of-the-art or competitive performance across tasks, including global human motion estimation, local human mesh recovery, video depth estimation, and camera pose estimation, with a single unified model. We hope that Human3R will serve as a simple yet strong baseline, be easily extended for downstream applications.Code available in https://fanegg.github.io/Human3R
PDF82October 8, 2025