VideoMDM: 2D教師信号からの3D人体動作生成に向けて

要旨

本稿ではVideoMDMを紹介する。これは拡散ベースのフレームワークであり、単眼ビデオから抽出された正確な2Dポーズのみを用いて、3Dの正解データを一切必要とせずに、3D人間動作の事前分布を直接訓練する。事前学習済みの2D-to-3Dリフターが近似的な3Dポーズ系列を提供し、それがノイズの多い教師として機能する。これらの系列は拡散され、モデルによって3D空間でノイズ除去された後、予測を再投影して正確なキーポイントと比較することで、2D空間で監視される。軽度の仮定の下で、深さ重み付き2D再投影損失が期待値として直接的な3D監視と等価であることを示し、標準的な3D動作正則化（速度一貫性および過パラメータ表現アラインメント）をこの2D設定に適応する。推論時のみ2Dから3Dへリフトする手法とは異なり、VideoMDMは訓練中に一貫性のある3D動作多様体を学習する。HumanML3Dにおいては、完全3D監視のMDMとの差をほぼ埋める（FID 0.88対0.54）。実動画データセットFit3DおよびNBAでは、本手法は人間が一貫して好む動作を生成することを学習し、強力な定量的結果を示す。

English

We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, denoised by the model in 3D, and supervised in 2D by reprojecting the prediction and comparing against accurate keypoints. We show that, under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision, and we adapt standard 3D motion regularizers - velocity consistency and over-parameterized representation alignment - to this 2D setting. Unlike methods that lift 2D to 3D only at inference, VideoMDM learns a coherent 3D motion manifold during training. On HumanML3D it nearly closes the gap to fully 3D-supervised MDM (FID 0.88 vs 0.54); On real video datasets Fit3D and NBA the method learns to generate motions consistently preferred by humans, with strong quantitative results.