VideoMDM: 2D 감독으로부터의 3D 인간 동작 생성을 향하여

초록

본 논문에서는 단일 시점 영상에서 추출된 정확한 2D 포즈만을 사용하여, 3D 실측 자료 없이 직접 3D 인간 동작 사전을 학습하는 확산 기반 프레임워크인 VideoMDM을 소개한다. 사전 학습된 2D-3D 리프터는 근사적인 3D 포즈 시퀀스를 제공하며, 이는 잡음 교사(noisy teacher) 역할을 한다. 즉, 이 시퀀스에 확산을 적용하고, 모델이 3D에서 잡음을 제거한 후, 예측 결과를 재투영하여 정확한 키포인트와 비교함으로써 2D에서 지도 학습을 수행한다. 약한 가정 하에서 깊이 가중 2D 재투영 손실은 기댓값 측면에서 직접적인 3D 지도 학습과 동등하며, 표준 3D 동작 정규화 기법(속도 일관성 및 과잉 매개변수화된 표현 정렬)을 이 2D 설정에 맞게 조정함을 보인다. 추론 시에만 2D에서 3D로 리프팅하는 방법과 달리, VideoMDM은 학습 과정에서 일관된 3D 동작 다양체(motion manifold)를 학습한다. HumanML3D 데이터셋에서는 완전 3D 지도 학습 기반 MDM과의 성능 격차를 거의 좁혔으며(FID 0.88 대 0.54), 실제 영상 데이터셋인 Fit3D와 NBA에서는 인간이 일관되게 선호하는 동작을 생성하는 방법을 학습하여 강력한 정량적 결과를 보여준다.

English

We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, denoised by the model in 3D, and supervised in 2D by reprojecting the prediction and comparing against accurate keypoints. We show that, under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision, and we adapt standard 3D motion regularizers - velocity consistency and over-parameterized representation alignment - to this 2D setting. Unlike methods that lift 2D to 3D only at inference, VideoMDM learns a coherent 3D motion manifold during training. On HumanML3D it nearly closes the gap to fully 3D-supervised MDM (FID 0.88 vs 0.54); On real video datasets Fit3D and NBA the method learns to generate motions consistently preferred by humans, with strong quantitative results.