VideoMDM: 迈向基于2D监督的3D人体运动生成
VideoMDM: Towards 3D Human Motion Generation From 2D Supervision
June 11, 2026
作者: Amir Mann, Gal Michael Harari, Merav Keidar, Or Litany
cs.AI
摘要
我们提出VideoMDM,这是一种基于扩散的框架,能够直接从单目视频中提取的精确2D姿态来训练3D人体运动先验,无需任何3D真值。预训练的2D转3D提升器提供近似3D姿态序列,作为有噪声的教师信号:这些序列经过扩散处理,由模型在3D空间中进行去噪,并通过重投影预测结果并与精确关键点比较,在2D空间中进行监督。我们证明,在温和假设下,深度加权的2D重投影损失在期望上等价于直接3D监督,并将标准3D运动正则化项——速度一致性和过参数化表示对齐——适配到2D设置。与仅在推理时将2D提升到3D的方法不同,VideoMDM在训练过程中学习连贯的3D运动流形。在HumanML3D上,它几乎缩小了与完全3D监督的MDM(FID 0.88 vs. 0.54)之间的差距;在真实视频数据集Fit3D和NBA上,该方法能够生成人类持续偏好的运动,并取得了强劲的定量结果。
English
We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, denoised by the model in 3D, and supervised in 2D by reprojecting the prediction and comparing against accurate keypoints. We show that, under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision, and we adapt standard 3D motion regularizers - velocity consistency and over-parameterized representation alignment - to this 2D setting. Unlike methods that lift 2D to 3D only at inference, VideoMDM learns a coherent 3D motion manifold during training. On HumanML3D it nearly closes the gap to fully 3D-supervised MDM (FID 0.88 vs 0.54); On real video datasets Fit3D and NBA the method learns to generate motions consistently preferred by humans, with strong quantitative results.